first_notification_delay - notification may be sent too early

Jochen Bern Jochen.Bern at LINworks.de
Wed Sep 22 19:25:28 CEST 2010


Sorry for the delay in replying, but I wasn't on the list back then and
the issue stepped into my crosshairs just today ...

On 05/20/2010 10:01 GMT, Andreas Ericsson wrote:
> On 05/20/2010 11:19 AM, Paweł Małachowski wrote:
>> according to manual:
>>> first_notification_delay: This directive is used to define the number of
>>> "time units" to wait before sending out the first problem notification when
>>> this host enters a non-UP state.
>> However, it may send notification earlier, because time is counted starting
>> from last UP state, not first non-UP state.
>> Code snippet for ilustration of this behaviour:
>>          if(type==NOTIFICATION_NORMAL&&  hst->current_notification_number==0&& 
>>              hst->current_state!=HOST_UP&&(current_time<(time_t)
>>                  ((hst->last_time_up==(time_t)0L)?program_start:hst->last_time_up
>>                      + (hst->first_notification_delay*interval_length)))){
>> Probably using "last_state_change" instead of "last_time_up" would be better
>> (haven't tried).
> 
> It used to be last_hard_state_change. I don't quite see why it was
> changed, apart from a dubious comment right on top of the code
> about not delaying recovery notifications, but that seems totally
> bogus, since it already checks that current state isn't HOST_UP.
> 
> The same goes for services, btw. You could try changing it to use
> last_hard_state_change instead of the current mess. If it works as
> advertised when you do, I'll make the adjustment to the nagios core
> so that the change goes in the next release.

How about we first pinpoint what *exactly* is *supposed* to happen? :-)

-- The very name "first_notification_delay" suggests that the timer
   should start when otherwise notification #1 would be sent; i.e.,
   when NewState == HARD non-OK && LastHardState == OK.

-- The description suggests that the timer already starts when the
   host/service changes from HARD OK to ***SOFT*** non-OK.

-- However, it could also be taken to mean that the first notification
   after entering a *specific* non-OK state (possibly from *another*
   non-OK state) should be affected as well.

As far as I can tell without taking a debugger to task, basing the logic
on last*state_change would imply that the delay gets retriggered on
changes between non-OK states, too.

There are fields hst->next_host_notification and svc->next_notification
which, as far as I've looked, are currently only used to implement
notification_interval and CMD_DELAY_HOST_NOTIFICATION; in particular,
checks.c sets only svc->next_notification, and only to (time_t)0 (if
state has changed or is OK), never to an "active" value. Wouldn't
setting these to time()+delay - e.g., in the non-OK branches of
process_host_check_result_3x() and handle_async_service_check_result() -
be a good approach to implement the from-OK-to-non-OK-only variant of
first_notification_delay?

Kind regards,
								J. Bern

P.S.: I just notice that the code has been changed in 3.2.2 as follows:

> if(type==NOTIFICATION_NORMAL && svc->current_notification_number==0 && svc->current_state!=STATE_OK){
>     /* determine the time to use of the first problem point */
>     first_problem_time=svc->last_time_ok; /* not accurate, but its the earliest time we could use in the comparison */
>     if((svc->last_time_warning < first_problem_time) && (svc->last_time_warning > svc->last_time_ok))
>         first_problem_time=svc->last_time_warning;
>     if((svc->last_time_unknown < first_problem_time) && (svc->last_time_unknown > svc->last_time_ok))
>         first_problem_time=svc->last_time_unknown;
>     if((svc->last_time_critical < first_problem_time) && (svc->last_time_critical > svc->last_time_ok))
>         first_problem_time=svc->last_time_critical;
>     if(current_time < (time_t)((first_problem_time==(time_t)0L)?program_start:
>         first_problem_time + (svc->first_notification_delay*interval_length))){

Umh ... is it just me, or are the conditions in the middle three "if"s
plain *impossible* to satisfy, once first_problem_time has been set to
the same value as svc->last_time_ok ... ?
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list