Performance issues, too

Andreas Ericsson ae at op5.se
Tue Dec 19 13:09:37 CET 2006


Tobias Klausmann wrote:
> Hi! 
> 
> On Tue, 19 Dec 2006, Andreas Ericsson wrote:
>> Thanks for an excellently detailed problem report, missing only the 
>> Nagios version and system type/version info. I've got some comments and 
>> followup questions. See below.
> 
> I'm running 2.6 now but I had the troubles with 2.5 initially.
> OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to
> 2.6.19 today.
> 
>>> ---------------------------
>>> Total hosts:                     330
>>> Total scheduled hosts:           0
>> No scheduled host-checks. That's good, cause they interfere with normal 
>> operations in Nagios.
> 
> I've read as much. In my seperate mail I had a few questions
> about it, let's keep them (and the answers there ;)
> 
>>> Host inter-check delay method:   SMART
>>> Average host check interval:     0.00 sec
>>> Host inter-check delay:          0.00 sec
>>> Max host check spread:           10 min
>>> First scheduled check:           N/A
>>> Last scheduled check:            N/A
>>>
>>>
>>> SERVICE SCHEDULING INFORMATION
>>> -------------------------------
>>> Total services:                     2836
>>> Total scheduled services:           2836
>>> Service inter-check delay method:   SMART
>>> Average service check interval:     2225.56 sec
>> This is, as you point out below, quite odd. What's your _longest_ 
>> normal_check_interval for services?
> 
> The longest check_interval is 86400 seconds. It's a SSL cert
> freshness check. I figured it wasn't necesseary to check that
> more often than once a day. I also have check_intervals of 3, 5,
> 15, 20, 30 and 1440 seconds. The latter is also a cert freshness
> check which is lower because the customer wanted it to be that
> short.
> 

Try changing the really long intervals to something shorter or 
commenting them out completely and see what happens. Checking a 
certificate is not a particularly heavy operation so it doesn't matter 
much if you run it ever 5 minutes. On the server side it just gets 
handed out from cache, so it's not heave there either.

If you have the various normal_check_interval's specified in templates, 
try setting them all to 5 minutes and let Nagios run over-night. If this 
interferes with some fragile services on the network (webservers whose 
sessions don't expire, fe), disable active checks for those services 
during the testing period.

(yes, this might seem braindead, but I really need to know if this bug 
is still in Nagios).

> 
>>> *Or* it is indicative of a misconfiguration on my
>>> part. If the latter is the case, I'd be eager, nay ecstatic to
>>> hear what I did wrong. Here are a few of the config vars that
>>> might influence this:
>> There has been a slight thinko in Nagios. I don't know if it's still 
>> there in recent CVS versions. The thinko is that it (used to?) calculate 
>> average service check interval by adding up all normal_check_interval 
>> values and dividing it by the number of services configured (or 
>> something along those lines), which leads to long latencies. This 
>> normally didn't make those latencies increase though. Humm...
> 
> Well, the numbers sure do get whacky after a restart: first it
> skyrockets for about five minutes, then plummets to 1s. From
> there it works its way up the way I described.
> 

Are the first checks of things being scheduled with unreasonably long 
delays? Fe, a check with 3 minute normal_check_interval being scheduled 
an hour or so into the future.


>>> Total Services:                       2836
>>> Services Checked:                     2836
>>> Services Scheduled:                   2758
>>> Active Service Checks:                2836
>>> Passive Service Checks:               0
>> All services aren't being scheduled, but you have no passive service 
>> checks. Have you disabled checks of 78 services?
> 
> Oops, forgot to mention that. Yes, a server farm is being rebuilt
> currently. As I didn't want all the host check timeouts to make
> matters much, much, worse, I disabled them entirely.
> 

Ah, that explains it then. It shouldn't matter, but unless the 
experiment I suggested above turns up anything useful, would you mind 
commenting them out and testing that?

>>> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
>>> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
>>> around 40% idle most of the time. I see about 300 context
>>> switches and 500 interrupts per second. The network load is
>>> neglible, ditto the packet rate.
>>>
>>> The way these figures look I don't see a performance problem per
>>> se, but maybe I have overlooked a metric that descirbes the
>>> "usual" bottleneck of installations.
>>>
>> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
>> cpu's, that causes up to 60% performance loss (yes, it really is that bad).
> 
> Sheesh. Yes, it is a 32-bit installation. I only ever bothered
> with 64-bit installs on Opteron hardware. I might look into
> migrating to 64 bits, then.
> 

So the CPU's are 64-bits? Humm... 64-bit mode would boost available 
resources quite a bit, but as you just enabled HT you should now have 3 
extra CPU's (Xeon's are dualcore AFAIR) which will probably set you safe 
for a while.

>> I'm puzzled. Please let me know if you find the answer to this problem. 
>> I'll help you debug it as best I can, but please continue posting 
>> on-list. Thanks.
> 
> Sure. I'll first check if the "processor upgrade" and kernel
> update helped anything, then try lowering the reaper interval to
> 2. I'll post the results as soon as I have them.
> 

It might help with the slowly creeping latencies. If the experiment 
above doesn't yield anything useful, try installing a 64-bit userland 
and recompiling Nagios and the plugins (and perl, and /bin/sh, wherever 
it points to) with a 64-bit compiler. It should quell any remaining 
resource starvation and let load average drop to around 0.5 - 1.0.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list