Was: Nagios arch to improve performance Re: Re: Nagios-devel digest, Vol 1 #807 - 8 msgs

Andreas Ericsson ae at op5.se
Mon May 23 17:21:07 CEST 2005


Ben wrote:
>>
>> That said, the current bottleneck in Nagios appears to be the fact  
>> that it runs checks in chunks rather than as standalone units which  
>> can be picked up as they become elligible for checking. If that  
>> little snag could be overcome, I'm confident that the  aforementioned 
>> average check latency of 25 seconds could be done  away with.
>>
> 
> 
> This is misleading. In my experience, Nagios doesn't run checks in  
> chunks. It *does* kick off as many concurrent checks as you tell it  
> (assuming there are things that need to be checked), but, if the  
> results come in while it's still trying to kick off more checks, it  
> stops doing that so it can process the new results. Because similar  
> checks tend to be started at similar times and take similarly long to  
> run, that means that it *appears* as if nagios kicks of a batch  checks, 
> then waits a while, then kicks off some more. In actuality,  it's 
> processing the results of the first batch before it does  anything else, 
> and the batch size is defined by how long it takes  from the first check 
> to be started until the first result comes in.
> 
> One possible way to speed this up is to trade in the rather simple  
> current model of "we can't initiate checks if we've got pending  
> results, because those results might alter what we need to check" for  
> the much more complex (but scaleable and possibly more correct) model  
> of "we can't send more checks that depend on the results of what we  
> currently have outstanding checks for, but if we want to check  
> unrelated services, not a problem."
> 
> It seems to me that would help an awful lot, assuming it was bug- free, 
> but it's also a pretty fundamental change to Nagios' scheduler.
> 

Not only the scheduler, but to be implemented efficiently it requires a 
fairly fundamental change in how nagios structures its memory (i.e. 
checks depending on other checks must be linked to those checks). 
Otherwise Nagios will just spend its time in hashfunc1() and hashfcun2() 
instead, looking for services that may or may not be depending or 
dependees of the elligible check.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click




More information about the Developers mailing list