Max concurrent checks - spreading the next_time

Andreas Ericsson ae at op5.se
Wed Jun 10 10:52:35 CEST 2009


Ton Voon wrote:
> Hi!
> 
> We've seen situations where this appears in the nagios.log:
> 
> Max concurrent service checks (50) has been reached. Delaying further 
> checks until previous checks are complete...
> 
> When switching on debugging, what we noticed is that services are 
> invoked all around the same time. I guess this happens when you have 
> selected a host and say "force check all services on this host".
> 
> What happens is that in the event code (base/events.c), it seems that if 
> this max_concurrent_checks is reached, then the service is ignored and 
> is rescheduled with a next check time based on the next regular check 
> interval. But if you do that, then all the other services will still be 
> invoked around the same time.
> 
> 
>   /* reschedule the check if we can't run it now */
>   if(run_event==FALSE){
>     /* remove the service check from the event queue and reschedule it 
> for a later time */
>     /* 12/20/05 since event was not executed, it needs to be remove()'ed 
> to maintain sync with event broker modules */
>     temp_event=event_list_low;
>     remove_event(temp_event,&event_list_low,&event_list_low_tail);
>     if(temp_service->state_type==SOFT_STATE && 
> temp_service->current_state!=STATE_OK)
>       
> temp_service->next_check=(time_t)(temp_service->next_check+(temp_service->retry_interval*interval_length)); 
> 
>     else
>       
> temp_service->next_check=(time_t)(temp_service->next_check+(temp_service->check_interval*interval_length)); 
> 
>     temp_event->run_time=temp_service->next_check;
>     reschedule_event(temp_event,&event_list_low,&event_list_low_tail);
>     update_service_status(temp_service,FALSE);
>     run_event=FALSE;
>   }
> 
> I propose that instead of setting next_time = next_time + 
> check_interval, that there is a random factor added, maybe something like:
> 
> next_time = now + max(5, min(int(rand(15)), 
> int(rand(retry_interval*interval_length))))
> 
> This means that the next check has been moved at least 5 seconds away 
> from now (to overcome the temporary load due to the number of concurrent 
> service checks), with a maximum of 15 seconds away (or less if the 
> retry_interval is lower).
> 
> Thoughts?
> 

I can't help but think that something like this could have been quite
easily resolved with a round-robin scheduling queue, where items requested
to be queued would simply get inserted within 5 seconds of the requested
time where there are the most free slots. The prng idea will probably
work just as well though, and I'm fairly certain you could just use

  next_time = service->check_interval - 7 + (*service->description & 0xf);

to get a distribution almost equally good without having to bother
about the PRNG-business. This would yield 7 seconds +-, which is
probably good enough.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects




More information about the Developers mailing list