[PATCH] notifications: Fix first_notification_delay

Robin Sonefors robin.sonefors at op5.com
Mon Dec 10 09:50:04 CET 2012


Great summary, thanks!

On 2012-12-09 05:04, eponymous alias wrote:
> Here is full background on this bug.
>
> These two sections of code in base/notifications.c are broken:
[snip]
> If you look at it closely, you'll see that most of the central if()'s
> are really just instances of:
>
>      if (B < A && B > A) ...
>
> which of course will never be satisfied, given the original value of
> first_problem_time.  And the second similar if() in the second section
> is really:
>
>      if (B < A && B > B) ...
>
> which is certainly non-functional.

So I noticed :)

> Bug history:
>
> A problem reported by Pawel Malachowski, May 20, 2010:
> http://comments.gmane.org/gmane.network.nagios.devel/7402
>
> Ethan Galstad's code change trying to address the reported issue,
> which introduced the bad code above, 2 Jun 2010:
> http://git.op5.org/git/?p=nagios.git;a=commitdiff_plain;h=7ff79f1352d738de97a905a4efc8204cf41db425
>
> Jochen Bern noticing the problem with the patch, mentioning it publicly,
> and proposing some in-depth thinking about what is really desired,
> September 22, 2010:
> http://permalink.gmane.org/gmane.network.nagios.devel/7521
> There was apparently no follow-up by anyone.

My two cents:

max_check_attempts and retry_interval already makes it very easy to set 
a delay for the first notification - in fact, I'd say they're way better 
than this mechanism, because they'll make sure a check is triggered when 
you want the notification, which first_notification_delay does not (and 
seemingly isn't supposed to).

Thus, as far as I can see, the value of having first_notification_delay 
is to set a delay that works regardless of state changes. Therefore, my 
patch implements point two, but none of the other, in Jochen's mail.

> Rogerio F Cunha noticing that the code wasn't working as desired, and
> proposing a patch, September 2, 2011.  Also, Andreas Ericsson thinking
> the patch was reasonable, but being busy that week, he apparently forgot
> about it.
> http://comments.gmane.org/gmane.network.nagios.devel/8176
>
> The code fragments above are the state of the code across all of the
> Nagios 3.2.3, 3.3.1, and 3.4.1 releases.  Perhaps it's time to take
> another look at Jochen Bern's proposal for thinking about the problem,
> and then seeing whether Rogerio F Cunha's patch actually moves the code
> in the desired direction.

Based on my reasoning above, I think the behavior he implements is less 
helpful than what I did.

However, I'd be more than happy to see my patch reverted and his 
applied, if it means that we stop basing the value on the last OK check, 
like we do currently :)

> ================================================================
> Related but different fix:
> http://blogs.op5.org/git/?p=nagios.git;a=commitdiff;h=ecfe7256ff90b1412986f276bf6cd6797d151203
>
> See also:
> http://tracker.nagios.org/view.php?id=297
> for the official bug report, and
> https://dev.icinga.org/issues/1145
> for a possible fix.

Those two are the same, and that's in nagios SVN already. It doesn't 
help the fact that we're still testing if ((A < B) && (A > B)) over and 
over.

> --- On Mon, 12/3/12, Robin Sonefors <robin.sonefors at op5.com> wrote:
>
>> From: Robin Sonefors <robin.sonefors at op5.com>
>> Subject: [Nagios-devel] [PATCH] notifications: Fix first_notification_delay
>> To: nagios-devel at lists.sourceforge.net
>> Cc: ae at op5.com
>> Date: Monday, December 3, 2012, 5:24 AM
>> Nagios bug #297, while being light on
>> details, points out a problem with
>> first_notification_delay that I think this patch clears up.
>>
>> The problem is two-fold: first, the logic for finding out
>> which state to
>> use is incorrect - a comment rightly points out that
>> last_time_{ok,up}
>> is the wrong value to use, and then proceeds to use any
>> value that is
>> both more recent and less recent than last_time_{ok,up} -
>> this condition
>> is likely to fail, meaning we always use the incorrect
>> value.
>>
>> Second, the logic is impossible to correct, though, because
>> all the
>> other values are incorrect, too - we always set
>> last_time_$state
>> whenever a check is executed, meaning the problem
>> last_time_$state
>> variable would usually be too recent.
>>
>> This is fixed by redefining last_time_$state. They used to
>> be "the last
>> time a check returned $state", now they're "the last time an
>> object was
>> in $state". The change is subtle, and the value is usually
>> updated the
>> way it was before, but when the state changes, we put the
>> end time of
>> the previous state in last_time_$previous_state, rather than
>> putting the
>> start time of the current state in last_time_$current_state.
>> Nagios core
>> does not use these variables anywhere else, and it's
>> unlikely to break
>> other peoples code, since last_state_change is a much better
>> way to get
>> the start time of the current state.
>>
>> With this change in place, all we need to do to find out
>> when a problem
>> began is to read last_time_{ok,up}, use that to compare
>> against
>> first_notification_delay, and we're done.
>>
>> Signed-off-by: Robin Sonefors <robin.sonefors at op5.com>
>> ---
>>   base/checks.c        |  8
>> ++++----
>>   base/notifications.c | 22 +++++-----------------
>>   2 files changed, 9 insertions(+), 21 deletions(-)
>>
>> diff --git a/base/checks.c b/base/checks.c
>> index 37aea54..af9e6ba 100644
>> --- a/base/checks.c
>> +++ b/base/checks.c
>> @@ -458,8 +458,8 @@ int
>> handle_async_service_check_result(service *temp_service,
>> check_result *queue
>>           }
>>
>>
>> -    /* record the last state time */
>> -    switch(temp_service->current_state)
>> {
>> +    /* record the time the last state ended
>> */
>> +    switch(temp_service->last_state) {
>>           case STATE_OK:
>>
>> temp_service->last_time_ok =
>> temp_service->last_check;
>>
>> break;
>> @@ -3236,8 +3236,8 @@ int handle_host_state(host *hst) {
>>       /* update performance data */
>>       update_host_performance_data(hst);
>>
>> -    /* record latest time for current state
>> */
>> -    switch(hst->current_state) {
>> +    /* record the time the last state ended
>> */
>> +    switch(hst->last_state) {
>>           case HOST_UP:
>>
>> hst->last_time_up = current_time;
>>
>> break;
>> diff --git a/base/notifications.c b/base/notifications.c
>> index 2dff2ef..226bc42 100644
>> --- a/base/notifications.c
>> +++ b/base/notifications.c
>> @@ -522,19 +522,12 @@ int
>> check_service_notification_viability(service *svc, int type,
>> int options) {
>>
>>       /* see if enough time has elapsed for
>> first notification (Mathias Sundman) */
>>       /* 10/02/07 don't place restrictions on
>> recoveries or non-normal notifications, must use last time
>> ok (or program start) in calculation */
>> -    /* it is reasonable to assume that if
>> the host was never up, the program start time should be used
>> in this calculation */
>> +    /* it is reasonable to assume that if
>> the service was never up, the program start time should be
>> used in this calculation */
>>       if(type == NOTIFICATION_NORMAL
>> && svc->current_notification_number == 0
>> && svc->current_state != STATE_OK) {
>>
>> -        /* determine the time
>> to use of the first problem point */
>> -        first_problem_time =
>> svc->last_time_ok; /* not accurate, but its the earliest
>> time we could use in the comparison */
>> -
>> if((svc->last_time_warning < first_problem_time)
>> && (svc->last_time_warning >
>> svc->last_time_ok))
>> -
>> first_problem_time = svc->last_time_warning;
>> -
>> if((svc->last_time_unknown < first_problem_time)
>> && (svc->last_time_unknown >
>> svc->last_time_ok))
>> -
>> first_problem_time = svc->last_time_unknown;
>> -
>> if((svc->last_time_critical < first_problem_time)
>> && (svc->last_time_critical >
>> svc->last_time_ok))
>> -
>> first_problem_time = svc->last_time_critical;
>> +        first_problem_time =
>> svc->last_time_ok > 0 ? svc->last_time_ok :
>> program_start;
>>
>> -        if(current_time <
>> (time_t)((first_problem_time == (time_t)0L) ? program_start
>> : first_problem_time) +
>> (time_t)(svc->first_notification_delay *
>> interval_length)) {
>> +        if(current_time <
>> first_problem_time +
>> (time_t)(svc->first_notification_delay *
>> interval_length)) {
>>
>> log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has
>> elapsed since the service changed to a non-OK state, so we
>> should not notify about this problem yet\n");
>>
>> return ERROR;
>>               }
>> @@ -1445,14 +1438,9 @@ int
>> check_host_notification_viability(host *hst, int type, int
>> options) {
>>       /* it is reasonable to assume that if
>> the host was never up, the program start time should be used
>> in this calculation */
>>       if(type == NOTIFICATION_NORMAL
>> && hst->current_notification_number == 0
>> && hst->current_state != HOST_UP) {
>>
>> -        /* determine the time
>> to use of the first problem point */
>> -        first_problem_time =
>> hst->last_time_up; /* not accurate, but its the earliest
>> time we could use in the comparison */
>> -
>> if((hst->last_time_down < first_problem_time)
>> && (hst->last_time_down >
>> hst->last_time_up))
>> -
>> first_problem_time = hst->last_time_down;
>> -
>> if((hst->last_time_unreachable < first_problem_time)
>> && (hst->last_time_unreachable >
>> hst->last_time_unreachable))
>> -
>> first_problem_time = hst->last_time_unreachable;
>> +        first_problem_time =
>> hst->last_time_up > 0 ? hst->last_time_up :
>> program_start;
>>
>> -        if(current_time <
>> (time_t)((first_problem_time == (time_t)0L) ? program_start
>> : first_problem_time) +
>> (time_t)(hst->first_notification_delay *
>> interval_length)) {
>> +        if(current_time <
>> first_problem_time +
>> (time_t)(hst->first_notification_delay *
>> interval_length)) {
>>
>> log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has
>> elapsed since the host changed to a non-UP state (or since
>> program start), so we shouldn't notify about this problem
>> yet.\n");
>>
>> return ERROR;
>>               }
>> --
>> 1.7.11.7
>>
>>
>> ------------------------------------------------------------------------------
>> Keep yourself connected to Go Parallel:
>> BUILD Helping you discover the best ways to construct your
>> parallel projects.
>> http://goparallel.sourceforge.net
>> _______________________________________________
>> Nagios-devel mailing list
>> Nagios-devel at lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>>
>
> ------------------------------------------------------------------------------
> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/logmein_12329d2d
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>


------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d




More information about the Developers mailing list