Scheduled checks falling far behind

Andreas Ericsson ae at op5.se
Mon Oct 25 00:02:29 CEST 2010


On 10/24/2010 10:14 PM, Litwin, Matthew wrote:
> Hi Matthieu (and anyone else who might want to throw their hat into
> the ring):
> 

I'll chip in. Your MUA seems to not wrap lines at all though, which
makes replying inline a bit tricky.

Note that you should wipe your status.sav files between restarts to
not let old latency affect the numbers you're seeing.

What system are you running this on? Nagios has been known to have
issues with older non-linux systems where thread libraries aren't
as forgiving as the nptl library shipped with glibc. Also, Nagios
should never run as a virtual guest.

As for the check_result_reaper_frequency things, we ship those unset
so they take the Nagios defaults. We used to have it at 2. I'm unsure
if removing the setting was a conscious choice or just by accident.

In general, you should keep your performance-data and checkresult
files on ramdisks. That will help preventing IO from becoming a
bottleneck.


> So after identifying that I have latency times that are around
> 500-600 seconds I have tried the tuning tips form the nagios docs,
> however I have fiddled with it and it while after the restart latency
> drops briefly, then just comes back up to the high levels again. At
> this point I have only been working with check_reaper_frequency and
> max_check_result_reaper_time by doubling and halving them from their
> default values. max_concurrent_checks remains at 0. Load on the
> server is very low. The machine is a 8 core machine so I really wish
> I could make better use of it. Load is a measly 1.5 on average.
> Finally, I tried enable_environment_macros = 0 which actually made it
> worse, once things quiesced after startup.
> use_large_installation_tweaks=1 did improve the latency by maybe %30
> and I did actually start seeing RRD data come in solid for about 15
> minutes but then it returned to being sparse again so while a modest
> improvement, it still doesn't fill RRD data to have useful data.
> 
> Any other tuning suggestions? I think I have done everything in the
> performance tweaks section that seems relevant, including all of
> those that have been suggested here.
> 

Make sure you haven't got "parallelize_check" set to 0 anywhere. That
will make Nagios try to run the checks one at a time, which obviously
doesn't work too well. If that's the case, you should have a latency
that corresponds to the amount of checks you're running times the
average check execution time minus the normal check-interval.

In other words; If you've got 900 checks in total, the average check
execution time is 1 second and you plan to run all checks in a 5 minute
interval (300 secs), you should get a latency of roughly 600 seconds.

If you've got it set for a few checks, Nagios will still fail to run
any other checks during the time the unparallelizeable check runs,
but it doesn't check if such checks are scheduled at the same time as
other checks when it schedules them, so latency will always be a bit
higher when not all checks are run in parallel.

> In summary, I am looking for some way to make nagios "do more" with
> the system resources as the host is barely working at all. I really
> wish there was some way to just make nagios to have some ability to
> do things more in parallel for cases where a system has plenty of
> horsepower and RAM. If I have to resort to compiling things with
> different settings I would be open to trying it, but I just feel like
> I am grasping at straws now.
> 

Are you using any eventbroker modules? If so, which ones and what
happens when you disable them?

What happens when you disable performance-data parsing and writing?

Is the system running as a virtual guest?

Do you have any checks with a check_interval that differs wildly
from the average check_interval? A while back there was a bug
that caused Nagios to spread the first service-check in a window
as big as the largest check_interval. Once all checks had been
executed, latency slowly normalized again. This doesn't seem to
match what you're describing, but it could be a similar bug
somewhere else. Using the same check_interval for all hosts and
services should tell if that's the case.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list