Max concurrent checks - spreading the next_time

Hiren Patel hir3npatel at gmail.com
Sat Jun 13 11:29:40 CEST 2009


Ton Voon wrote:
> This is the test case:
>    * set max_concurrent_checks=1 in nagios.cfg
>    * create a host with 3 services with a check_interval of 1 minute
>    * restart nagios
>    * go to the host page and schedule a check for all services on the  
> host (this makes all the services run at the same time)
>    * tail nagios.log. Should see "Max concurrent service checks (1)  
> has been reached"
>    * on the host page, notice the last run time. Only one will be  
> updated after 1 minute. All services get scheduled for the next time  
> at the same time, and after the next minute, only one of those will  
> have the last check time changed
>
yip exactly the behavior you describe. I setup a standalone machine 
running the default checks against itself, and the queue shows them all 
scheduled for the same time the next minute. also the log entries appear 
as you describe.

> I've just committed a patch into CVS HEAD. This nudges the time ahead  
> by 5 + random(10) seconds. I've also included a test case which  
> ensures that the nudge factor is added in these cases.
> 
> nagios.log will also have an entry which lists the affected service.  
> If you get this message a lot on a regular system, then you need to  
> consider increasing the max_concurrent_checks value.
> 
> I'd be grateful if you could try this out.
>
with the patch, I see the check spread in the queue now, and all the 
services are checked quicker than in the case without the patch, at 
least this is what I noticed. there is one odd behavior, with the 
default tests running, one check kept getting nudged, and as a result 
wasn't run for a while. attached is the nagios.log, the first two 
restarts are without the patch, and then with the patch. for the entire 
duration I ran with the patch, the "current users" check had not been 
run. am I doing something wrong in testing this though?

> Thinking some more, setting the next check time ahead doesn't really  
> make sense, because the latency value does not reflect the fact that  
> this active service's check time was delayed. Maybe this should be  
> implemented as a remove of the event from the queue, and then re-added  
> with a nudged event run time but the old service->next_check time.
> 
> Anyhow, this should be better than it was.
agree about the latency, although it is logging the incident so users 
should catch why their checks are running a little delayed. not sure 
about the event queue and how it works yet, haven't looked at this part 
of nagios.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios.log
Type: text/x-log
Size: 18322 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090613/0620cf3b/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list