Check becomes unplanned

Andreas Ericsson ae at op5.se
Tue Sep 16 08:40:12 CEST 2008


Bernd Arnold wrote:
> Thanks for all the comments. Yesterday I did some further 
> testing and debugging. And it was very interesting.
> 
> Well, forget the ntp story for a moment. The real problem is 
> not the time shift, although it seems it is. 
> The problem is the if-clause:
> 
> /* the service could not be rescheduled properly - set the next check time for next year, but don't actually reschedule it */
> if(time_is_valid==FALSE && next_valid_time==preferred_time){
> 
> If a service should be checked and the current check time is outside the 
> timeperiod, the service becomes unplanned / planned-in-one-year.
> 
> This could happen when the system time is changed backwards, either by ntp
> or manually. But it could also happen when the timeperiod is changed, Nagios config 
> is reloaded and the next check time value is outside this timeperiod.
> 
> Test 1:
> Next service check at 08:24
> Timeperiod changed to 08:43-18:00 (end time isn't important)
> Reload Nagios config
> At 08:24 the service is rescheduled to 08:43
> This is okay.
> 
> Test 2:
> Next service check at 08:39
> Timeperiod changed to 08:40-18:00
> Reload Nagios config
> At 08:39 the service is set to next check in one year and should_be_scheduled=FALSE
> This isn't okay.
> 
> So where is the difference?
> The success of the reschedule depends on the time difference: (timeperiod_begin - next_check).
> 
> In the first test, the timeperiod begins in 19 minutes. 
> This is more than five minutes. In the second test,
> there is only one minute left to the begin of the
> timeperiod.
> 
> Let's have a look into checks.c (about line 280) again:
> 
> if(current_time>=preferred_time)
>   preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
> 
> The preferred_time is set to 5 minutes in the future.
> This is 08:29 (test 1) or 08:44 (test 2).
> Next code line:
> 
> get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);
> 
> The next_valid_time is set to the beginning of the timeperiod.
> Test 1: this is 08:43, since preferred time is outside timeperiod.
> Test 2: this is 08:44, since the preferred time (now + 5min) is accepted.
> 
> When the next if-check is true, the next check time is set to +1 year.
> 
> if(time_is_valid==FALSE && next_valid_time==preferred_time){
> 
> This is only true for test 2 - our preferred time is the next_valid_time 
> set by the get_next_valid_time(...) function.
> 
> I've added a patch file. The if-statement was removed and the next check 
> time is always set to the preferred time.
> 
> Now both tests work fine and is a) ntp time changes aware and b) timeperiod changes aware.
> 
> Test 1:
> Next check at 11:05
> Timeperiod changed beginning 11:07
> Reload nagios config
> At 11:05 the next check is set to 11:10 (current time + five minutes)
> 
> Test 2:
> Next check at 11:10
> Timeperiod changed beginning 11:37
> Reload nagios config
> At 11:10 the next check is set to 11:37
> 
> I don't know if my patch has a negative influence to other things.

The only thing I can imagine is a billion checks getting scheduled at
the same time, but judging from your test output that doesn't seem to
be the case. Either way, that would be preferrable to having the check
switch itself off.

> Maybe someone could check this? Who knows the background of the if-clause?
> 

The logic goes all the way back to Tue Feb 26 04:03:25 2002, a commit named
"Initial revision" ("git log -S" is a wonderful tool :)), so I'm guessing it's
a sort of "oh shit" thing that just happened to stay alive because it hardly
ever triggered, or when it triggered it was just shrugged off.

I don't have time to dig into the sideeffects of the surrounding code back
in that time, but it seems your patch should be applied not only to services
but to hosts too (and would have been better than the original logic anyway).

> The patch covers only service checks so far! The same change could be applied 
> to host checks (line 2795 in checks.c) if my patch doesn't have a bad impact.
> 

Indeed. I've applied it to my patch queue (with changes to host scheduling logic
too) and will hand it over to Ethan next chance I get.

Very nice work both of you. Thanks.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list