Completely stumped

Andreas Ericsson ae at op5.se
Thu Jan 18 17:47:27 CET 2007

Previous message: Completely stumped
Next message: Completely stumped
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Tobias Klausmann wrote:
> Hi!
> 
> The other day, we got our beefier machine. I had hoped my latency
> problems (ever increasing check latencies) would go away or at
> least turn irrelevant with that. They didn't.
> 
> More precisely: we have migrated to a four-core Opteron 2.2GHz
> with 2GBs of RAM and a quite fast I/O Subsystem. 
> 
> We have 331 / 2940 hosts/services which are all checked actively.
> 
> Still, after less than an hour, our check latency skyrockets well
> over 120s. Unacceptable.
> 

This sounds truly bizarre. Could you send me the nagios binary you're
using along with all your configuration, as well as the status.sav file?

Make sure to remove any passwords and stuff in the configuration before
you bzip2 it up and send it.


> I've tested a whole slew of stuff in order to find out what the
> hell is wrong. I've played with concurrency settings and just
> about any performance tip save distributing the setup.
> 
> Nothing worked.
> 
> Not a single metric on the machine itself (interrupt rate,
> context switches or anything else the *stat utilities show me)
> tells me it's the machine's fault.
> 
> I'm out of ideas (and to be frank, a bit desperate).
> 
> What the hell can I do?
> 
> The *only* thing I've left to try is removing the multiuser patch
> we talked about at the end of last year. If that does it, at
> least I have an idea *where* in the code my problem lies. I'll
> try that route tonight.
> 

Which patch was this? I didn't find it in the december archives.

For now though, try lowering your reaper frequency to 2
(lowest allowed value) and see what happens.

When you get desperate, set all your services to be checked with the
exact same interval settings (5 minutes normal_check_interval, 1 minute
retry_check_interval or something like that).

---%<---%<---%<---
There has been a slight thinko in Nagios. I don't know if it's still
there in recent CVS versions. The thinko is that it (used to?) calculate
average service check interval by adding up all normal_check_interval
values and dividing it by the number of services configured (or
something along those lines), which leads to long latencies. This
normally didn't make those latencies increase though. Humm...
---%<---%<---%<---

If that one fails, I think only some long-on-going monitoring of both
the server and Nagios (using printf()-debugging) could get to the
bottom of this.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Completely stumped
Next message: Completely stumped
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list