[PATCH] notifications: Fix first_notification_delay

eponymous alias eponymousalias at yahoo.com
Sun Dec 9 05:04:29 CET 2012
Previous message: [PATCH] notifications: Fix first_notification_delay
Next message: [PATCH] notifications: Fix first_notification_delay
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Here is full background on this bug.

These two sections of code in base/notifications.c are broken:

--------
        /* see if enough time has elapsed for first notification (Mathias Sundman) */
        /* 10/02/07 don't place restrictions on recoveries or non-normal notifications, must use last time ok (or program start) in calculation */
        /* it is reasonable to assume that if the host was never up, the program start time should be used in this calculation */
        if(type == NOTIFICATION_NORMAL && svc->current_notification_number == 0 && svc->current_state != STATE_OK) {

                /* determine the time to use of the first problem point */
                first_problem_time = svc->last_time_ok; /* not accurate, but its the earliest time we could use in the comparison */
                if((svc->last_time_warning < first_problem_time) && (svc->last_time_warning > svc->last_time_ok))
                        first_problem_time = svc->last_time_warning;
                if((svc->last_time_unknown < first_problem_time) && (svc->last_time_unknown > svc->last_time_ok))
                        first_problem_time = svc->last_time_unknown;
                if((svc->last_time_critical < first_problem_time) && (svc->last_time_critical > svc->last_time_ok))
                        first_problem_time = svc->last_time_critical;

                if(current_time < (time_t)((first_problem_time == (time_t)0L) ? program_start : first_problem_time) + (time_t)(svc->first_notification_delay * interval_length)) {
                        log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet\n");
                        return ERROR;
                        }
                }

--------
        /* see if enough time has elapsed for first notification (Mathias Sundman) */
        /* 10/02/07 don't place restrictions on recoveries or non-normal notifications, must use last time up (or program start) in calculation */
        /* it is reasonable to assume that if the host was never up, the program start time should be used in this calculation */
        if(type == NOTIFICATION_NORMAL && hst->current_notification_number == 0 && hst->current_state != HOST_UP) {

                /* determine the time to use of the first problem point */
                first_problem_time = hst->last_time_up; /* not accurate, but its the earliest time we could use in the comparison */
                if((hst->last_time_down < first_problem_time) && (hst->last_time_down > hst->last_time_up))
                        first_problem_time = hst->last_time_down;
                if((hst->last_time_unreachable < first_problem_time) && (hst->last_time_unreachable > hst->last_time_unreachable))
                        first_problem_time = hst->last_time_unreachable;

                if(current_time < (time_t)((first_problem_time == (time_t)0L) ? program_start : first_problem_time) + (time_t)(hst->first_notification_delay * interval_length)) {
                        log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has elapsed since the host changed to a non-UP state (or since program start), so we shouldn't notify about this problem yet.\n");
                        return ERROR;
                        }
                }
--------

If you look at it closely, you'll see that most of the central if()'s
are really just instances of:

    if (B < A && B > A) ...

which of course will never be satisfied, given the original value of
first_problem_time.  And the second similar if() in the second section
is really:

    if (B < A && B > B) ...

which is certainly non-functional.

Bug history:

A problem reported by Pawel Malachowski, May 20, 2010:
http://comments.gmane.org/gmane.network.nagios.devel/7402

Ethan Galstad's code change trying to address the reported issue,
which introduced the bad code above, 2 Jun 2010:
http://git.op5.org/git/?p=nagios.git;a=commitdiff_plain;h=7ff79f1352d738de97a905a4efc8204cf41db425

Jochen Bern noticing the problem with the patch, mentioning it publicly,
and proposing some in-depth thinking about what is really desired,
September 22, 2010:
http://permalink.gmane.org/gmane.network.nagios.devel/7521
There was apparently no follow-up by anyone.

Rogerio F Cunha noticing that the code wasn't working as desired, and
proposing a patch, September 2, 2011.  Also, Andreas Ericsson thinking
the patch was reasonable, but being busy that week, he apparently forgot
about it.
http://comments.gmane.org/gmane.network.nagios.devel/8176

The code fragments above are the state of the code across all of the
Nagios 3.2.3, 3.3.1, and 3.4.1 releases.  Perhaps it's time to take
another look at Jochen Bern's proposal for thinking about the problem,
and then seeing whether Rogerio F Cunha's patch actually moves the code
in the desired direction.

================================================================
Related but different fix:
http://blogs.op5.org/git/?p=nagios.git;a=commitdiff;h=ecfe7256ff90b1412986f276bf6cd6797d151203

See also:
http://tracker.nagios.org/view.php?id=297
for the official bug report, and
https://dev.icinga.org/issues/1145
for a possible fix.

--- On Mon, 12/3/12, Robin Sonefors <robin.sonefors at op5.com> wrote:

> From: Robin Sonefors <robin.sonefors at op5.com>
> Subject: [Nagios-devel] [PATCH] notifications: Fix first_notification_delay
> To: nagios-devel at lists.sourceforge.net
> Cc: ae at op5.com
> Date: Monday, December 3, 2012, 5:24 AM
> Nagios bug #297, while being light on
> details, points out a problem with
> first_notification_delay that I think this patch clears up.
> 
> The problem is two-fold: first, the logic for finding out
> which state to
> use is incorrect - a comment rightly points out that
> last_time_{ok,up}
> is the wrong value to use, and then proceeds to use any
> value that is
> both more recent and less recent than last_time_{ok,up} -
> this condition
> is likely to fail, meaning we always use the incorrect
> value.
> 
> Second, the logic is impossible to correct, though, because
> all the
> other values are incorrect, too - we always set
> last_time_$state
> whenever a check is executed, meaning the problem
> last_time_$state
> variable would usually be too recent.
> 
> This is fixed by redefining last_time_$state. They used to
> be "the last
> time a check returned $state", now they're "the last time an
> object was
> in $state". The change is subtle, and the value is usually
> updated the
> way it was before, but when the state changes, we put the
> end time of
> the previous state in last_time_$previous_state, rather than
> putting the
> start time of the current state in last_time_$current_state.
> Nagios core
> does not use these variables anywhere else, and it's
> unlikely to break
> other peoples code, since last_state_change is a much better
> way to get
> the start time of the current state.
> 
> With this change in place, all we need to do to find out
> when a problem
> began is to read last_time_{ok,up}, use that to compare
> against
> first_notification_delay, and we're done.
> 
> Signed-off-by: Robin Sonefors <robin.sonefors at op5.com>
> ---
>  base/checks.c        |  8
> ++++----
>  base/notifications.c | 22 +++++-----------------
>  2 files changed, 9 insertions(+), 21 deletions(-)
> 
> diff --git a/base/checks.c b/base/checks.c
> index 37aea54..af9e6ba 100644
> --- a/base/checks.c
> +++ b/base/checks.c
> @@ -458,8 +458,8 @@ int
> handle_async_service_check_result(service *temp_service,
> check_result *queue
>          }
>  
>  
> -    /* record the last state time */
> -    switch(temp_service->current_state)
> {
> +    /* record the time the last state ended
> */
> +    switch(temp_service->last_state) {
>          case STATE_OK:
>             
> temp_service->last_time_ok =
> temp_service->last_check;
>             
> break;
> @@ -3236,8 +3236,8 @@ int handle_host_state(host *hst) {
>      /* update performance data */
>      update_host_performance_data(hst);
>  
> -    /* record latest time for current state
> */
> -    switch(hst->current_state) {
> +    /* record the time the last state ended
> */
> +    switch(hst->last_state) {
>          case HOST_UP:
>             
> hst->last_time_up = current_time;
>             
> break;
> diff --git a/base/notifications.c b/base/notifications.c
> index 2dff2ef..226bc42 100644
> --- a/base/notifications.c
> +++ b/base/notifications.c
> @@ -522,19 +522,12 @@ int
> check_service_notification_viability(service *svc, int type,
> int options) {
>  
>      /* see if enough time has elapsed for
> first notification (Mathias Sundman) */
>      /* 10/02/07 don't place restrictions on
> recoveries or non-normal notifications, must use last time
> ok (or program start) in calculation */
> -    /* it is reasonable to assume that if
> the host was never up, the program start time should be used
> in this calculation */
> +    /* it is reasonable to assume that if
> the service was never up, the program start time should be
> used in this calculation */
>      if(type == NOTIFICATION_NORMAL
> && svc->current_notification_number == 0
> && svc->current_state != STATE_OK) {
>  
> -        /* determine the time
> to use of the first problem point */
> -        first_problem_time =
> svc->last_time_ok; /* not accurate, but its the earliest
> time we could use in the comparison */
> -       
> if((svc->last_time_warning < first_problem_time)
> && (svc->last_time_warning >
> svc->last_time_ok))
> -           
> first_problem_time = svc->last_time_warning;
> -       
> if((svc->last_time_unknown < first_problem_time)
> && (svc->last_time_unknown >
> svc->last_time_ok))
> -           
> first_problem_time = svc->last_time_unknown;
> -       
> if((svc->last_time_critical < first_problem_time)
> && (svc->last_time_critical >
> svc->last_time_ok))
> -           
> first_problem_time = svc->last_time_critical;
> +        first_problem_time =
> svc->last_time_ok > 0 ? svc->last_time_ok :
> program_start;
>  
> -        if(current_time <
> (time_t)((first_problem_time == (time_t)0L) ? program_start
> : first_problem_time) +
> (time_t)(svc->first_notification_delay *
> interval_length)) {
> +        if(current_time <
> first_problem_time +
> (time_t)(svc->first_notification_delay *
> interval_length)) {
>             
> log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has
> elapsed since the service changed to a non-OK state, so we
> should not notify about this problem yet\n");
>             
> return ERROR;
>              }
> @@ -1445,14 +1438,9 @@ int
> check_host_notification_viability(host *hst, int type, int
> options) {
>      /* it is reasonable to assume that if
> the host was never up, the program start time should be used
> in this calculation */
>      if(type == NOTIFICATION_NORMAL
> && hst->current_notification_number == 0
> && hst->current_state != HOST_UP) {
>  
> -        /* determine the time
> to use of the first problem point */
> -        first_problem_time =
> hst->last_time_up; /* not accurate, but its the earliest
> time we could use in the comparison */
> -       
> if((hst->last_time_down < first_problem_time)
> && (hst->last_time_down >
> hst->last_time_up))
> -           
> first_problem_time = hst->last_time_down;
> -       
> if((hst->last_time_unreachable < first_problem_time)
> && (hst->last_time_unreachable >
> hst->last_time_unreachable))
> -           
> first_problem_time = hst->last_time_unreachable;
> +        first_problem_time =
> hst->last_time_up > 0 ? hst->last_time_up :
> program_start;
>  
> -        if(current_time <
> (time_t)((first_problem_time == (time_t)0L) ? program_start
> : first_problem_time) +
> (time_t)(hst->first_notification_delay *
> interval_length)) {
> +        if(current_time <
> first_problem_time +
> (time_t)(hst->first_notification_delay *
> interval_length)) {
>             
> log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has
> elapsed since the host changed to a non-UP state (or since
> program start), so we shouldn't notify about this problem
> yet.\n");
>             
> return ERROR;
>              }
> -- 
> 1.7.11.7
> 
> 
> ------------------------------------------------------------------------------
> Keep yourself connected to Go Parallel: 
> BUILD Helping you discover the best ways to construct your
> parallel projects.
> http://goparallel.sourceforge.net
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d
Previous message: [PATCH] notifications: Fix first_notification_delay
Next message: [PATCH] notifications: Fix first_notification_delay
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list