Nagios retries checks too soon.

Jochen Bern Jochen.Bern at LINworks.de
Fri Jun 10 21:15:28 CEST 2011


On 06/10/2011 07:48 PM, Paul M Dubuc wrote:
> Jochen Bern wrote:
>> IIRC, the actual
>> code adds check_interval/retry_interval to the variable that holds the
>> (previous) scheduled check time - i.e., the time when the previous check
>> supposedly was *started* (assuming negligible check latency).
> 
> I was under the impression that the retry interval
> was only counted from the time the previous check completes and the
> status (which is needed to determine if a retry is necessary) is known.
>  Why is the retry time determined before it's know that one is needed?

Hmmmmmm. It seems that I misremembered ... partially.

> # egrep -n 'current_time.*(check|retry)_interval' nagios-3.2.3/base/checks.c
> 276:                            preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
> 1825:                   preferred_time=current_time+check_interval;
> 1843:                   preferred_time=current_time+check_interval;
> 2814:                           preferred_time=current_time+((hst->check_interval<=0)?300:(hst->check_interval*interval_length));
> 3446:   next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3482:                   next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3555:                                   next_check=(unsigned long)(current_time+(hst->retry_interval*interval_length));
> 3559:                                   next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3585:                           next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3603:                           next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3705:                                   next_check=(unsigned long)(current_time+(hst->retry_interval*interval_length));
> 3709:                                   next_check=(unsigned long)(current_time+(hst->check_interval*interval_length));
> 3879:                   preferred_time=current_time+check_interval;
> 3893:                   preferred_time=current_time+check_interval;


> # egrep -n 'last_check.*(check|retry)_interval' nagios-3.2.3/base/checks.c
> 1304:                   next_service_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));
> 1450:                                   next_service_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));
> 1478:                                   next_service_check=(time_t)(temp_service->last_check+(temp_service->retry_interval*interval_length));
> 1545:                           next_service_check=(time_t)(temp_service->last_check+(temp_service->check_interval*interval_length));

Lemme have a closer look at the latter matches ...

They cover handle_async_service_check_result(). (Since there also is a
handle_async_host_check_result_3x() *elsewhere*, we clearly have
different behaviour between host and service checks.)

1304 is the catchall for STATE_OK results.
1450 is the special case for SOFT non-OK services on non-UP hosts.
1478 is its counterpart for UP hosts.
1545 covers HARD non-OK services.

Verification (looking at the *other* matches) ...

2814 through 3893 deal with *host* checks, 276 with *synchronous*
service checks (why is there no retry_interval??), 1825 and 1843 only
check viability, not results.

All in all, I'd say that async service checks, and *only* those, behave
the way I described. Not sure whether there may or may not be a *reason*
to ... anyone?

Kind regards,
								J. Bern
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel

------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev




More information about the Users mailing list