Max concurrent checks - spreading the next_time

Ton Voon ton.voon at opsera.com
Tue Jun 9 23:31:59 CEST 2009


Hi!

We've seen situations where this appears in the nagios.log:

Max concurrent service checks (50) has been reached. Delaying further  
checks until previous checks are complete...

When switching on debugging, what we noticed is that services are  
invoked all around the same time. I guess this happens when you have  
selected a host and say "force check all services on this host".

What happens is that in the event code (base/events.c), it seems that  
if this max_concurrent_checks is reached, then the service is ignored  
and is rescheduled with a next check time based on the next regular  
check interval. But if you do that, then all the other services will  
still be invoked around the same time.


   /* reschedule the check if we can't run it now */
   if(run_event==FALSE){
     /* remove the service check from the event queue and reschedule  
it for a later time */
     /* 12/20/05 since event was not executed, it needs to be  
remove()'ed to maintain sync with event broker modules */
     temp_event=event_list_low;
     remove_event(temp_event,&event_list_low,&event_list_low_tail);
     if(temp_service->state_type==SOFT_STATE && temp_service- 
 >current_state!=STATE_OK)
       temp_service->next_check=(time_t)(temp_service->next_check+ 
(temp_service->retry_interval*interval_length));
     else
       temp_service->next_check=(time_t)(temp_service->next_check+ 
(temp_service->check_interval*interval_length));
     temp_event->run_time=temp_service->next_check;
     reschedule_event(temp_event,&event_list_low,&event_list_low_tail);
     update_service_status(temp_service,FALSE);
     run_event=FALSE;
   }

I propose that instead of setting next_time = next_time +  
check_interval, that there is a random factor added, maybe something  
like:

next_time = now + max(5, min(int(rand(15)),  
int(rand(retry_interval*interval_length))))

This means that the next check has been moved at least 5 seconds away  
from now (to overcome the temporary load due to the number of  
concurrent service checks), with a maximum of 15 seconds away (or less  
if the retry_interval is lower).

Thoughts?

Ton

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090609/69d5f85e/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list