[PATCH] notifications: Fix first_notification_delay

Andreas Ericsson ae at op5.se
Mon Dec 10 16:44:17 CET 2012


On 12/10/2012 09:50 AM, Robin Sonefors wrote:
> Great summary, thanks!
> 
> On 2012-12-09 05:04, eponymous alias wrote:
>> Here is full background on this bug.
>>
>> These two sections of code in base/notifications.c are broken:
> [snip]
>> If you look at it closely, you'll see that most of the central if()'s
>> are really just instances of:
>>
>>      if (B < A && B > A) ...
>>
>> which of course will never be satisfied, given the original value of
>> first_problem_time.  And the second similar if() in the second section
>> is really:
>>
>>      if (B < A && B > B) ...
>>
>> which is certainly non-functional.
> 
> So I noticed :)
> 
>> Bug history:
>>
>> A problem reported by Pawel Malachowski, May 20, 2010:
>> http://comments.gmane.org/gmane.network.nagios.devel/7402
>>
>> Ethan Galstad's code change trying to address the reported issue,
>> which introduced the bad code above, 2 Jun 2010:
>> http://git.op5.org/git/?p=nagios.git;a=commitdiff_plain;h=7ff79f1352d738de97a905a4efc8204cf41db425
>>
>> Jochen Bern noticing the problem with the patch, mentioning it publicly,
>> and proposing some in-depth thinking about what is really desired,
>> September 22, 2010:
>> http://permalink.gmane.org/gmane.network.nagios.devel/7521
>> There was apparently no follow-up by anyone.
> 
> My two cents:
> 
> max_check_attempts and retry_interval already makes it very easy to set a delay for the first notification - in fact, I'd say they're way better than this mechanism, because they'll make sure a check is triggered when you want the notification, which first_notification_delay does not (and seemingly isn't supposed to).
> 
> Thus, as far as I can see, the value of having first_notification_delay is to set a delay that works regardless of state changes. Therefore, my patch implements point two, but none of the other, in Jochen's mail.
> 

AFAIR, the original use-case was to allow operators to react to HARD
alerts and acknowledge or fix them before notifications were sent out.
As such, the logic that makes "first_notification_delay" only trigger
notifications after a new check makes perfect sense.

It's unfortunate that the original algorithm didn't schedule a check
to run at the exact time when a notification was supposed to be sent
and then, pending non-OK check status, sent the notification, but
ignoring checks between the hard failure and the notification is not
a viable solution either, and adding support for having multiple
checks scheduled at the same time would make this more complex than
necessary. It would also go against all the online documentation
regarding this feature (such as blogposts and what not).

Based on that reasoning, it seems the following rules would make the
most sense:
* first_notification_delay should delay notifications since the most
  recent HARD problem state but await the result of a check before it
  actually sends a notification.
* If delaying the notification causes it to end up in a time where
  notifications should be sent, it should be sent even if the time of
  the alert happened during a period when no notifications should have
  been sent.
* If delaying the check causes it to switch to a state which should
  not result in a notification, no notification should be sent out.
* Delaying a notification should not increase its notification_number,
  and will, as such, affect both regular and escalated notifications.
* Custom-, downtime-, acknowledgement and flapping notifications will
  never be delayed (flapping is arguable, but matches current code).

Comments on that? I'm busy writing documentation a while longer, so
feel free to chip in. I'll apply something on wednesday if I haven't
heard any arguments for or against before that.

And yes, this is now officially the thinking session for what to do
with it, so we'll make a decision here and get rid of the wretched
issue once and for all.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d




More information about the Developers mailing list