[PATCH] notifications: Fix first_notification_delay

Robin Sonefors robin.sonefors at op5.com
Mon Dec 3 14:24:18 CET 2012
Next message: [PATCH] notifications: Fix first_notification_delay
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Nagios bug #297, while being light on details, points out a problem with
first_notification_delay that I think this patch clears up.

The problem is two-fold: first, the logic for finding out which state to
use is incorrect - a comment rightly points out that last_time_{ok,up}
is the wrong value to use, and then proceeds to use any value that is
both more recent and less recent than last_time_{ok,up} - this condition
is likely to fail, meaning we always use the incorrect value.

Second, the logic is impossible to correct, though, because all the
other values are incorrect, too - we always set last_time_$state
whenever a check is executed, meaning the problem last_time_$state
variable would usually be too recent.

This is fixed by redefining last_time_$state. They used to be "the last
time a check returned $state", now they're "the last time an object was
in $state". The change is subtle, and the value is usually updated the
way it was before, but when the state changes, we put the end time of
the previous state in last_time_$previous_state, rather than putting the
start time of the current state in last_time_$current_state. Nagios core
does not use these variables anywhere else, and it's unlikely to break
other peoples code, since last_state_change is a much better way to get
the start time of the current state.

With this change in place, all we need to do to find out when a problem
began is to read last_time_{ok,up}, use that to compare against
first_notification_delay, and we're done.

Signed-off-by: Robin Sonefors <robin.sonefors at op5.com>
---
 base/checks.c        |  8 ++++----
 base/notifications.c | 22 +++++-----------------
 2 files changed, 9 insertions(+), 21 deletions(-)

diff --git a/base/checks.c b/base/checks.c
index 37aea54..af9e6ba 100644
--- a/base/checks.c
+++ b/base/checks.c
@@ -458,8 +458,8 @@ int handle_async_service_check_result(service *temp_service, check_result *queue
 		}
 
 
-	/* record the last state time */
-	switch(temp_service->current_state) {
+	/* record the time the last state ended */
+	switch(temp_service->last_state) {
 		case STATE_OK:
 			temp_service->last_time_ok = temp_service->last_check;
 			break;
@@ -3236,8 +3236,8 @@ int handle_host_state(host *hst) {
 	/* update performance data */
 	update_host_performance_data(hst);
 
-	/* record latest time for current state */
-	switch(hst->current_state) {
+	/* record the time the last state ended */
+	switch(hst->last_state) {
 		case HOST_UP:
 			hst->last_time_up = current_time;
 			break;
diff --git a/base/notifications.c b/base/notifications.c
index 2dff2ef..226bc42 100644
--- a/base/notifications.c
+++ b/base/notifications.c
@@ -522,19 +522,12 @@ int check_service_notification_viability(service *svc, int type, int options) {
 
 	/* see if enough time has elapsed for first notification (Mathias Sundman) */
 	/* 10/02/07 don't place restrictions on recoveries or non-normal notifications, must use last time ok (or program start) in calculation */
-	/* it is reasonable to assume that if the host was never up, the program start time should be used in this calculation */
+	/* it is reasonable to assume that if the service was never up, the program start time should be used in this calculation */
 	if(type == NOTIFICATION_NORMAL && svc->current_notification_number == 0 && svc->current_state != STATE_OK) {
 
-		/* determine the time to use of the first problem point */
-		first_problem_time = svc->last_time_ok; /* not accurate, but its the earliest time we could use in the comparison */
-		if((svc->last_time_warning < first_problem_time) && (svc->last_time_warning > svc->last_time_ok))
-			first_problem_time = svc->last_time_warning;
-		if((svc->last_time_unknown < first_problem_time) && (svc->last_time_unknown > svc->last_time_ok))
-			first_problem_time = svc->last_time_unknown;
-		if((svc->last_time_critical < first_problem_time) && (svc->last_time_critical > svc->last_time_ok))
-			first_problem_time = svc->last_time_critical;
+		first_problem_time = svc->last_time_ok > 0 ? svc->last_time_ok : program_start;
 
-		if(current_time < (time_t)((first_problem_time == (time_t)0L) ? program_start : first_problem_time) + (time_t)(svc->first_notification_delay * interval_length)) {
+		if(current_time < first_problem_time + (time_t)(svc->first_notification_delay * interval_length)) {
 			log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has elapsed since the service changed to a non-OK state, so we should not notify about this problem yet\n");
 			return ERROR;
 			}
@@ -1445,14 +1438,9 @@ int check_host_notification_viability(host *hst, int type, int options) {
 	/* it is reasonable to assume that if the host was never up, the program start time should be used in this calculation */
 	if(type == NOTIFICATION_NORMAL && hst->current_notification_number == 0 && hst->current_state != HOST_UP) {
 
-		/* determine the time to use of the first problem point */
-		first_problem_time = hst->last_time_up; /* not accurate, but its the earliest time we could use in the comparison */
-		if((hst->last_time_down < first_problem_time) && (hst->last_time_down > hst->last_time_up))
-			first_problem_time = hst->last_time_down;
-		if((hst->last_time_unreachable < first_problem_time) && (hst->last_time_unreachable > hst->last_time_unreachable))
-			first_problem_time = hst->last_time_unreachable;
+		first_problem_time = hst->last_time_up > 0 ? hst->last_time_up : program_start;
 
-		if(current_time < (time_t)((first_problem_time == (time_t)0L) ? program_start : first_problem_time) + (time_t)(hst->first_notification_delay * interval_length)) {
+		if(current_time < first_problem_time + (time_t)(hst->first_notification_delay * interval_length)) {
 			log_debug_info(DEBUGL_NOTIFICATIONS, 1, "Not enough time has elapsed since the host changed to a non-UP state (or since program start), so we shouldn't notify about this problem yet.\n");
 			return ERROR;
 			}
-- 
1.7.11.7


------------------------------------------------------------------------------
Keep yourself connected to Go Parallel: 
BUILD Helping you discover the best ways to construct your parallel projects.
http://goparallel.sourceforge.net
Next message: [PATCH] notifications: Fix first_notification_delay
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list