massive service check latencies

Ben bench at silentmedia.com
Wed Mar 23 20:50:31 CET 2005


I tracked this down, and now I have to appeal to those of you with better 
Nagios knowledge than I about the best way to fix it. Here's the deal....

1. Nagios schedules a bunch of checks. Those checks start to kick off and
run in parallel. This is great.

2. All these checks start returning results. Ok....

3. When the reaper kicks off, it is a high priority event, and has to
process all outstanding check results before any more checks can be made
(which are low priority). I don't know if it's nagios-db that's slowing me
down, but reaping 4 dozen check results is taking anywhere between 4-10
seconds.  For this duration, no more checks can be work it scheduled.


So now I have to ask, what's the proper way to fix this? It would be 
pretty trivial for me to run the reaper in a second thread, or to put the 
work involved in reaping an individual check result into a thread pool, 
but I don't know what assumptions that would be violating. 

On Mon, 21 Mar 2005, Ben wrote:

> I've been having a horrible time with service check latencies. I've got
> ~6k services so I thought at first maybe my hardware couldn't keep up.  
> But after moving to much beefier hardware, things have actually gotten
> worse, not better. So I figured, I'd been running a recent beta...
> maybe one of the new checkins fixed something. I tried to pull down the
> latest from CVS this morning, and it has the same situation.
> 
> So now I think I just have a basic misunderstanding of the way nagios
> schedules checks. Here's how I've tweaked my settings to try to make 
> things run more frequently:
> 
> service_inter_check_delay_method=n
> max_service_check_spread=60
> service_interleave_factor=s
> host_inter_check_delay_method=n
> max_host_check_spread=60
> max_concurrent_checks=0
> service_reaper_frequency=5
> 
> What I notice is that checks are queued up several dozen at a time, and
> that they all have to finish before the next batch can begin. As far as I
> can tell, there is no way to make the size of the batch grow, or to stop
> waiting for all checks to finish before moving on. The hardware (dual 2.8
> xeon with 2.5GB of ram dedicated to monitoring) is not at all stressed.
> 
> 
> Interestingly, while my service check latencies average around 500 
> seconds, my host check latencies are well under 1 second, which is what I 
> would expect. FWIW, I've got about 2300 hosts.
> 
> Oh, and the average execution time for both service and host checks is 
> about 3 seconds.
> 
> 




-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click




More information about the Developers mailing list