Antwort: Re: Check becomes unplanned

Sascha.Runschke at gfkl.com Sascha.Runschke at gfkl.com
Wed Sep 10 10:59:24 CEST 2008


Hi Bernd,
hi Andreas,

> To alleviate your issue, you should be running an ntp daemon
> on the Nagios server which slews the clock into its right
> time rather than sets it (slew = make it go slightly faster
> or slower until it matches the correct time). Are you running
> ntpdate via a cronjob or something?
>
> I'm not sure how one would go about debugging this, as the
> time required to run a single test is prohibitive for rapid
> repeated testing.

I already encountered that problem before and started debugging it,
so I'll just share my knowledge so far. Sadly I didn't get the time
yet to really pinpoint a solution to it and produce a patch.
I'm not that big fan of C ;)

How to produce it:

- define a check "freaky_check" with limited check_period, let's
  call it 7to11 and a check_interval of 3
- produce steady time-shifts backwards (nagios running in a VM someone?)

What happens:

1. it's 11pm, nagios schedules freaky_check for 7am according to its 
check_period
2. Every X minutes timeshift -1 sec (jittering timesource)
3. nagios tries to compensate it and adjusts _all_ checks to the timeshift 
(next_check = next_check - timeshift)
4. time goes by from 11pm to 6am, shifting time for - let's say - 8 
minutes back
5. freaky_check is now scheduled for 6:52am because of the timeshifts
6. it's 6:52am and nagios tries to run the freaky_check according to the 
schedule
7. sanity check says: ERROR: check outside check_period
8. nagios tries to compensate with a strange logic: next_check = 
next_check + check_interval and just hopes it will fit
9. nagios reruns the sanity check: FATAL ERROR: check still outside 
check_period - I have no clue what to do: rescheduling freaky_check: 
next_check = next_check + 1year
10. user puzzled and nagios thinks it's all cool

Conclusion:

This behaviour turns up when the following criterias are met:

- check has a reduced check_period
- time is shifting back
- the timeshift outside the check_period is greater then 2 times the
  check_interval

You can look it up in base/checks.c within the
run_scheduled_service_check(service *svc, int check_options, double 
latency)
function for example. 

After some basic checks this will be run:

/* attempt to run the check */
result=run_async_service_check(svc,check_options,latency,TRUE,TRUE,&time_is_valid,&preferred_time);

which in turn ends up with:

/* is the service check viable at this time? */
if(check_service_check_viability(svc,check_options,time_is_valid,preferred_time)==ERROR)
   return ERROR;

No, since nagios shifted it outside its check_period, the time is NOT 
valid.

Back in run_scheduled_service_check we now enter the (if result==ERROR) 
tree:

/* get current time */
time(&current_time);

/* determine next time we should check the service if needed */
/* if service has no check interval, schedule it again for 5 minutes from 
now */
if(current_time>=preferred_time)
 
preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));

COMMENT: nagios added the check_interval to preferred_time

/* make sure we rescheduled the next service check at a valid time */
get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);

COMMENT: No, it didn't do as adding check_interval was not enough to 
compensate the backshift in time

/* the service could not be rescheduled properly - set the next check time 
for next year, but don't
 actually reschedule it */
if(time_is_valid==FALSE && next_valid_time==preferred_time){

COMMENT: nagios it bailing out here and just adding 1 year to 
preferred_time to get the scheduler running again

svc->next_check=(time_t)(next_valid_time+(60*60*24*365));
svc->should_be_scheduled=FALSE;

log_debug_info(DEBUGL_CHECKS,1,"Unable to find any valid times to 
reschedule the next service check!\n");
                                }

/* this service could be rescheduled... */
  else{
        svc->next_check=next_valid_time;
        svc->should_be_scheduled=TRUE;

        log_debug_info(DEBUGL_CHECKS,1,"Rescheduled next service check for 
%s",ctime(&next_valid_time));
      }
}

COMMENT: BÄNG - our check just got shoved to mars - landing in 1 year and 
we don't even get
a notification for it and it does not orphan or whatever...


The question is now - what's the smartest way to handle this?
Basically I see 2 different approaches:

1. When compensating timeshifts - doublecheck that you do not move a check 
outside its valid check_period
2. When trying to schedule checks, that somehow ran outside its 
check_period - try to be smart and look for
the next valid time inside the check_period of that check instead of just 
adding check_interval and naivly
hoping for it to be allright

Ok, so far from me - /discuss :-)

S

-- 
Sascha Runschke
IT-Infrastruktur

GFKL Financial Services AG
Limbecker Platz 1
45127 Essen

Telefon : +49 (201) 102-1879 Mobil : +49 (173) 5419665 Fax : +49 (201) 
102-1102105



GFKL Financial Services AG
Vorstand: Dr. Peter Jänsch (Vors.), Jürgen Baltes, Dr. Till Ergenzinger, Dr. Tom Haverkamp
Vorsitzender des Aufsichtsrats: Dr. Georg F. Thoma
Sitz: Limbecker Platz 1, 45127 Essen, Amtsgericht Essen, HRB 13522
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20080910/7e69270c/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list