Scheduled checks falling far behind

Litwin, Matthew mlitwin at stubhub.com
Mon Oct 25 04:16:20 CEST 2010


It would appear that running nagiosgraph in immediate mode was the latency cause. However, since batch mode has some problems that dash out big chunks of data that won't work either, so it looks like I will need to finds another solution, sadly, especially since I have invested so much time to solve this. :-C

On Oct 24, 2010, at 4:19 PM, Litwin, Matthew wrote:

> 
> On Oct 24, 2010, at 3:02 PM, Andreas Ericsson wrote:
> 
>> On 10/24/2010 10:14 PM, Litwin, Matthew wrote:
>>> Hi Matthieu (and anyone else who might want to throw their hat into
>>> the ring):
>>> 
>> 
>> I'll chip in. Your MUA seems to not wrap lines at all though, which
>> makes replying inline a bit tricky.
> 
> Sorry. Blame Apple. :-)
> 
>> 
>> Note that you should wipe your status.sav files between restarts to
>> not let old latency affect the numbers you're seeing.
> 
> I don't seem to have them on my system.
>> 
>> What system are you running this on? Nagios has been known to have
>> issues with older non-linux systems where thread libraries aren't
>> as forgiving as the nptl library shipped with glibc. Also, Nagios
>> should never run as a virtual guest.
> 
> It is a 8 core x86 server running CentOS 5.3
> 
>> As for the check_result_reaper_frequency things, we ship those unset
>> so they take the Nagios defaults. We used to have it at 2. I'm unsure
>> if removing the setting was a conscious choice or just by accident.
> 
> I will give it a try, thanks.
>> 
>> In general, you should keep your performance-data and checkresult
>> files on ramdisks. That will help preventing IO from becoming a
>> bottleneck.
> 
> IO wait on the sever is  is on average 1% so I doubt that is the problem, but certainly worth investigating.
>> 
>> 
>>> So after identifying that I have latency times that are around
>>> 500-600 seconds I have tried the tuning tips form the nagios docs,
>>> however I have fiddled with it and it while after the restart latency
>>> drops briefly, then just comes back up to the high levels again. At
>>> this point I have only been working with check_reaper_frequency and
>>> max_check_result_reaper_time by doubling and halving them from their
>>> default values. max_concurrent_checks remains at 0. Load on the
>>> server is very low. The machine is a 8 core machine so I really wish
>>> I could make better use of it. Load is a measly 1.5 on average.
>>> Finally, I tried enable_environment_macros = 0 which actually made it
>>> worse, once things quiesced after startup.
>>> use_large_installation_tweaks=1 did improve the latency by maybe %30
>>> and I did actually start seeing RRD data come in solid for about 15
>>> minutes but then it returned to being sparse again so while a modest
>>> improvement, it still doesn't fill RRD data to have useful data.
>>> 
>>> Any other tuning suggestions? I think I have done everything in the
>>> performance tweaks section that seems relevant, including all of
>>> those that have been suggested here.
>>> 
>> 
>> Make sure you haven't got "parallelize_check" set to 0 anywhere. That
>> will make Nagios try to run the checks one at a time, which obviously
>> doesn't work too well. If that's the case, you should have a latency
>> that corresponds to the amount of checks you're running times the
>> average check execution time minus the normal check-interval.
>> 
>> In other words; If you've got 900 checks in total, the average check
>> execution time is 1 second and you plan to run all checks in a 5 minute
>> interval (300 secs), you should get a latency of roughly 600 seconds.
>> 
>> If you've got it set for a few checks, Nagios will still fail to run
>> any other checks during the time the unparallelizeable check runs,
>> but it doesn't check if such checks are scheduled at the same time as
>> other checks when it schedules them, so latency will always be a bit
>> higher when not all checks are run in parallel.
>> 
>>> In summary, I am looking for some way to make nagios "do more" with
>>> the system resources as the host is barely working at all. I really
>>> wish there was some way to just make nagios to have some ability to
>>> do things more in parallel for cases where a system has plenty of
>>> horsepower and RAM. If I have to resort to compiling things with
>>> different settings I would be open to trying it, but I just feel like
>>> I am grasping at straws now.
>>> 
>> 
>> Are you using any eventbroker modules? If so, which ones and what
>> happens when you disable them?
> 
> Not that I know of.
>> 
>> What happens when you disable performance-data parsing and writing?
> 
> Actually, that was what I am trying to get working properly. My RRD data files are sparse as a result.
> 
>> 
>> Is the system running as a virtual guest?
> 
> No, it is a hard server.
> 
>> 
>> Do you have any checks with a check_interval that differs wildly
>> from the average check_interval?
> 
> All of my check_interval settings are 5 with a few that are a little bit less.
> 
> I am running 3.2.1
> 
> Documentation suggest I set the check_interval for hosts to 0. Is that appropriate?
> 
>> A while back there was a bug
>> that caused Nagios to spread the first service-check in a window
>> as big as the largest check_interval. Once all checks had been
>> executed, latency slowly normalized again. This doesn't seem to
>> match what you're describing, but it could be a similar bug
>> somewhere else. Using the same check_interval for all hosts and
>> services should tell if that's the case.
>> 
>> -- 
>> Andreas Ericsson                   andreas.ericsson at op5.se
>> OP5 AB                             www.op5.se
>> Tel: +46 8-230225                  Fax: +46 8-230231
>> 
>> Considering the successes of the wars on alcohol, poverty, drugs and
>> terror, I think we should give some serious thought to declaring war
>> on peace.
> 
> 
> ------------------------------------------------------------------------------
> Nokia and AT&T present the 2010 Calling All Innovators-North America contest
> Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
> $10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
> Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
> http://p.sf.net/sfu/nokia-dev2dev
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null


------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list