Scheduled checks falling far behind

Andreas Ericsson ae at op5.se
Mon Oct 25 09:37:17 CEST 2010


On 10/25/2010 01:19 AM, Litwin, Matthew wrote:
> On Oct 24, 2010, at 3:02 PM, Andreas Ericsson wrote:
>>
>> Note that you should wipe your status.sav files between restarts to
>> not let old latency affect the numbers you're seeing.
> 
> I don't seem to have them on my system.

Perhaps you haven't enabled state retention then. If you have, there
should be something of the kind in nagios' /var directory. grep your
nagios.cfg for retent and see what comes up.

>>
>> What system are you running this on? Nagios has been known to have
>> issues with older non-linux systems where thread libraries aren't
>> as forgiving as the nptl library shipped with glibc. Also, Nagios
>> should never run as a virtual guest.
> 
> It is a 8 core x86 server running CentOS 5.3
> 

Not virtual and not what you'd call underdimensioned for the task, then.

>>
>> In general, you should keep your performance-data and checkresult
>> files on ramdisks. That will help preventing IO from becoming a
>> bottleneck.
> 
> IO wait on the sever is  is on average 1% so I doubt that is the
> problem, but certainly worth investigating.

That won't be it then.

>>
>>
>>> So after identifying that I have latency times that are around
>>> 500-600 seconds I have tried the tuning tips form the nagios docs,
>>> however I have fiddled with it and it while after the restart latency
>>> drops briefly, then just comes back up to the high levels again. At
>>> this point I have only been working with check_reaper_frequency and
>>> max_check_result_reaper_time by doubling and halving them from their
>>> default values. max_concurrent_checks remains at 0. Load on the
>>> server is very low. The machine is a 8 core machine so I really wish
>>> I could make better use of it. Load is a measly 1.5 on average.
>>> Finally, I tried enable_environment_macros = 0 which actually made it
>>> worse, once things quiesced after startup.
>>> use_large_installation_tweaks=1 did improve the latency by maybe %30
>>> and I did actually start seeing RRD data come in solid for about 15
>>> minutes but then it returned to being sparse again so while a modest
>>> improvement, it still doesn't fill RRD data to have useful data.
>>>
>>> Any other tuning suggestions? I think I have done everything in the
>>> performance tweaks section that seems relevant, including all of
>>> those that have been suggested here.
>>>
>>
>> Make sure you haven't got "parallelize_check" set to 0 anywhere. That
>> will make Nagios try to run the checks one at a time, which obviously
>> doesn't work too well. If that's the case, you should have a latency
>> that corresponds to the amount of checks you're running times the
>> average check execution time minus the normal check-interval.
>>

Since you didn't respond to this, I'll just assume you haven't got it
set to 0 for any host or service.

>>
>>> In summary, I am looking for some way to make nagios "do more" with
>>> the system resources as the host is barely working at all. I really
>>> wish there was some way to just make nagios to have some ability to
>>> do things more in parallel for cases where a system has plenty of
>>> horsepower and RAM. If I have to resort to compiling things with
>>> different settings I would be open to trying it, but I just feel like
>>> I am grasping at straws now.
>>>
>>
>> Are you using any eventbroker modules? If so, which ones and what
>> happens when you disable them?
> 
> Not that I know of.

grep broker /path/to/nagios.cfg

will tell you.

>>
>> What happens when you disable performance-data parsing and writing?
> 
> Actually, that was what I am trying to get working properly. My RRD
> data files are sparse as a result.

Even so, try disabling it for a bit and see if the way performance
data is handled is causing problems. What performance-data gathering
solution are you using?

>>
>> Do you have any checks with a check_interval that differs wildly
>> from the average check_interval?
> 
> All of my check_interval settings are 5 with a few that are a little bit less.
> 
> I am running 3.2.1
> 
> Documentation suggest I set the check_interval for hosts to 0. Is that appropriate?
> 

That will make Nagios only run host-checks when they're needed (ie,
when a services on the host changes from OK to any other state).
It's definitely worth trying.

It could also be worth setting check_for_updates=0 in your nagios.cfg.
The update checks are high priority events which will block checks
while it's running. It shouldn't matter, since those checks are run
with a 22 hour interval, but every little bit helps, I guess.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list