massive service check latencies

Andreas Ericsson ae at op5.se
Wed Mar 23 17:45:30 CET 2005


Ben wrote:
> I've been having a horrible time with service check latencies. I've got
> ~6k services so I thought at first maybe my hardware couldn't keep up.  
> But after moving to much beefier hardware, things have actually gotten
> worse, not better. So I figured, I'd been running a recent beta...
> maybe one of the new checkins fixed something. I tried to pull down the
> latest from CVS this morning, and it has the same situation.
> 

I assume you're running the very latest of the 2.x branch then.

> So now I think I just have a basic misunderstanding of the way nagios
> schedules checks. Here's how I've tweaked my settings to try to make 
> things run more frequently:
> 
> service_inter_check_delay_method=n

 From nagios docs, regarding service_inter_check_delay_method;
n = Don't use any delay - schedule all service checks to run immediately 
(i.e. at the same time!)

Perhaps this would be better of as 0.3 or s (s meaning nagios determines 
how often it needs to check things).

> max_service_check_spread=60

With this statement you're telling nagios to spread its checks over an 
entire hour. The docs also say that this overrides 
service_inter_check_delay ("if necessary", whatever that means).

> service_interleave_factor=s

Seems correct.

> host_inter_check_delay_method=n
> max_host_check_spread=60

Either you've overconfigured your nagios, or you have enabled scheduled 
hostchecks without reading the docs about it. Host checks are executed 
in serial (one at a time), so you'll see some serious service check 
latencies if you have them enabled.

> max_concurrent_checks=0
> service_reaper_frequency=5
> 

This seems right, but if load isn't high you should set 
service_reaper_frequency lower. Try 2 or something.

> What I notice is that checks are queued up several dozen at a time, and
> that they all have to finish before the next batch can begin.

Non-true. Service checks are scheduled and run on-demand. Scheduled 
hostchecks fuck up the service check scheduling.

> As far as I
> can tell, there is no way to make the size of the batch grow, or to stop
> waiting for all checks to finish before moving on. The hardware (dual 2.8
> xeon with 2.5GB of ram dedicated to monitoring) is not at all stressed.
> 
> 
> Interestingly, while my service check latencies average around 500 
> seconds, my host check latencies are well under 1 second, which is what I 
> would expect. FWIW, I've got about 2300 hosts.
> 
> Oh, and the average execution time for both service and host checks is 
> about 3 seconds.
> 

With perl checks you can most likely cut that to 20% with this simple 
sed line;
sed -i -e 's/\(^#.*/bin/perl\).*/\1/' -e 's/use strict;/# \&/'
sed 4.0.9 or higher required (for the -i switch). In effect, it removes 
the strict pragma and all switches (such as -wT) for perl.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list