Antwort: Re: Check becomes unplanned

Andreas Ericsson ae at op5.se
Sun Sep 14 11:59:03 CEST 2008


Sascha.Runschke at gfkl.com wrote:
> Hi Bernd,
> hi Andreas,
> 
>> To alleviate your issue, you should be running an ntp daemon
>> on the Nagios server which slews the clock into its right
>> time rather than sets it (slew = make it go slightly faster
>> or slower until it matches the correct time). Are you running
>> ntpdate via a cronjob or something?
>>
>> I'm not sure how one would go about debugging this, as the
>> time required to run a single test is prohibitive for rapid
>> repeated testing.
> 
> I already encountered that problem before and started debugging it,
> so I'll just share my knowledge so far. Sadly I didn't get the time
> yet to really pinpoint a solution to it and produce a patch.
> I'm not that big fan of C ;)
> 
> How to produce it:
> 
> - define a check "freaky_check" with limited check_period, let's
>   call it 7to11 and a check_interval of 3
> - produce steady time-shifts backwards (nagios running in a VM someone?)
> 
> What happens:
> 
> 1. it's 11pm, nagios schedules freaky_check for 7am according to its 
> check_period
> 2. Every X minutes timeshift -1 sec (jittering timesource)
> 3. nagios tries to compensate it and adjusts _all_ checks to the timeshift 
> (next_check = next_check - timeshift)
> 4. time goes by from 11pm to 6am, shifting time for - let's say - 8 
> minutes back
> 5. freaky_check is now scheduled for 6:52am because of the timeshifts
> 6. it's 6:52am and nagios tries to run the freaky_check according to the 
> schedule
> 7. sanity check says: ERROR: check outside check_period
> 8. nagios tries to compensate with a strange logic: next_check = 
> next_check + check_interval and just hopes it will fit
> 9. nagios reruns the sanity check: FATAL ERROR: check still outside 
> check_period - I have no clue what to do: rescheduling freaky_check: 
> next_check = next_check + 1year
> 10. user puzzled and nagios thinks it's all cool
> 
> Conclusion:
> 
> This behaviour turns up when the following criterias are met:
> 
> - check has a reduced check_period
> - time is shifting back
> - the timeshift outside the check_period is greater then 2 times the
>   check_interval
> 
> You can look it up in base/checks.c within the
> run_scheduled_service_check(service *svc, int check_options, double 
> latency)
> function for example. 
> 

Nice detective work there. It really pinpoints the place to start at.

> 
> The question is now - what's the smartest way to handle this?
> Basically I see 2 different approaches:
> 
> 1. When compensating timeshifts - doublecheck that you do not move a check 
> outside its valid check_period
> 2. When trying to schedule checks, that somehow ran outside its 
> check_period - try to be smart and look for
> the next valid time inside the check_period of that check instead of just 
> adding check_interval and naivly
> hoping for it to be allright
> 

I believe option 2 is the right one, ie, scheduled the check to run
at "next-timeperiod-start-time + $random_subminute_time", as timeranges
only have minute precision.

It could also be correct to drop the entire "clock has changed"
logic (with a logging message when we expect that it has) and just
accept the fact that if the system clock is going bananas, we can't
really hope for any checks to be executed on the right time anyway.
That would be the path of least surprise, I think, although less
than stellar in terms of maintaining the check interval accurate.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list