Performance issues, too

Tobias Klausmann klausman at schwarzvogel.de
Tue Dec 19 13:47:16 CET 2006


Hi! 

On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> >>> -------------------------------
> >>> Total services:                     2836
> >>> Total scheduled services:           2836
> >>> Service inter-check delay method:   SMART
> >>> Average service check interval:     2225.56 sec
> >> This is, as you point out below, quite odd. What's your _longest_ 
> >> normal_check_interval for services?
> > 
> > The longest check_interval is 86400 seconds. It's a SSL cert
> > freshness check. I figured it wasn't necesseary to check that
> > more often than once a day. I also have check_intervals of 3, 5,
> > 15, 20, 30 and 1440 seconds. The latter is also a cert freshness
> > check which is lower because the customer wanted it to be that
> > short.
> 
> Try changing the really long intervals to something shorter or 
> commenting them out completely and see what happens. Checking a 
> certificate is not a particularly heavy operation so it doesn't matter 
> much if you run it ever 5 minutes. On the server side it just gets 
> handed out from cache, so it's not heave there either.
> 
> If you have the various normal_check_interval's specified in templates, 
> try setting them all to 5 minutes and let Nagios run over-night. If this 
> interferes with some fragile services on the network (webservers whose 
> sessions don't expire, fe), disable active checks for those services 
> during the testing period.
> 
> (yes, this might seem braindead, but I really need to know if this bug 
> is still in Nagios).

I'll do that this afternoon, I'd just like to wait a little more
regarding the changes my kernel/cpu-update brings (or doesn't).

> >>> *Or* it is indicative of a misconfiguration on my
> >>> part. If the latter is the case, I'd be eager, nay ecstatic to
> >>> hear what I did wrong. Here are a few of the config vars that
> >>> might influence this:
> >> There has been a slight thinko in Nagios. I don't know if it's still 
> >> there in recent CVS versions. The thinko is that it (used to?) calculate 
> >> average service check interval by adding up all normal_check_interval 
> >> values and dividing it by the number of services configured (or 
> >> something along those lines), which leads to long latencies. This 
> >> normally didn't make those latencies increase though. Humm...
> > 
> > Well, the numbers sure do get whacky after a restart: first it
> > skyrockets for about five minutes, then plummets to 1s. From
> > there it works its way up the way I described.
> 
> Are the first checks of things being scheduled with unreasonably long 
> delays? Fe, a check with 3 minute normal_check_interval being scheduled 
> an hour or so into the future.

Usually, yes. As I use state retention, I don't believe in the
initial numbers all that much. After about 5-10 minutes one can
usually make out a trend. Not this time, though. Here's hoping
that it keeps oscillating around the 8-9 seconds I currently.

> >>> Total Services:                       2836
> >>> Services Checked:                     2836
> >>> Services Scheduled:                   2758
> >>> Active Service Checks:                2836
> >>> Passive Service Checks:               0
> >> All services aren't being scheduled, but you have no passive service 
> >> checks. Have you disabled checks of 78 services?
> > 
> > Oops, forgot to mention that. Yes, a server farm is being rebuilt
> > currently. As I didn't want all the host check timeouts to make
> > matters much, much, worse, I disabled them entirely.
> 
> Ah, that explains it then. It shouldn't matter, but unless the 
> experiment I suggested above turns up anything useful, would you mind 
> commenting them out and testing that?

I was planning to do that tomorrow for the very same reasons.

> >>> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
> >>> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
> >>> around 40% idle most of the time. I see about 300 context
> >>> switches and 500 interrupts per second. The network load is
> >>> neglible, ditto the packet rate.
> >>>
> >>> The way these figures look I don't see a performance problem per
> >>> se, but maybe I have overlooked a metric that descirbes the
> >>> "usual" bottleneck of installations.
> >>>
> >> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
> >> cpu's, that causes up to 60% performance loss (yes, it really is that bad).
> > 
> > Sheesh. Yes, it is a 32-bit installation. I only ever bothered
> > with 64-bit installs on Opteron hardware. I might look into
> > migrating to 64 bits, then.
> > 
> 
> So the CPU's are 64-bits? Humm... 64-bit mode would boost available 
> resources quite a bit, but as you just enabled HT you should now have 3 
> extra CPU's (Xeon's are dualcore AFAIR) which will probably set you safe 
> for a while.

Colleague just told me that this particular batch wasn't
available in 64 bits. So no, they're 32bits, well one thing to
test out of the way :-/

> >> I'm puzzled. Please let me know if you find the answer to this problem. 
> >> I'll help you debug it as best I can, but please continue posting 
> >> on-list. Thanks.
> > 
> > Sure. I'll first check if the "processor upgrade" and kernel
> > update helped anything, then try lowering the reaper interval to
> > 2. I'll post the results as soon as I have them.
> 
> It might help with the slowly creeping latencies. If the experiment 
> above doesn't yield anything useful, try installing a 64-bit userland 
> and recompiling Nagios and the plugins (and perl, and /bin/sh, wherever 
> it points to) with a 64-bit compiler. It should quell any remaining 
> resource starvation and let load average drop to around 0.5 - 1.0.

Well, I already convinced my boss that we need beefier machines
anyway (we currently have no spare machine and all current
machines are too incompatible to act as spares. If it tells you
anything: we're on a single HP DL360G3 now and will be moving to
two DL360G5 or two DL365 early next year).

Still, we need to switch to a distributed setup - another
headache I'd rather put off til nex year.

Regards,
Tobias
-- 
Never touch a burning system.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list