[PATCH] notifications: Fix first_notification_delay

Andreas Ericsson ae at op5.se
Mon Dec 10 23:33:46 CET 2012


On 12/10/2012 06:27 PM, Jochen Bern wrote:
> On 10.12.2012 16:44, Andreas Ericsson wrote:
>> AFAIR, the original use-case [of first_notification_delay] was
>> to allow operators to react to HARD alerts and acknowledge or
>> fix them before notifications were sent out.
> 
> At least, that's how some organizations *did and do* use it. And some
> probably also add the on-turning-HARD event handler execution to the mix
> of things that hopefully might make notifications unnecessary in the
> last second.
> 
> FWIW, from a rather principles-oriented point of view, the sequence of
> SOFT non-OK --> HARD non-OK --> first notification (with the different
> degrees of visibility these states imply) is as much a part of the
> system of escalations as the part *called* "escalations" is. I wonder
> whether a long-term consolidation of terms and mechanisms might prove
> beneficial.
> 
>> * If delaying the notification causes it to end up in a time where
>>    notifications should be sent, it should be sent even if the time of
>>    the alert happened during a period when no notifications should have
>>    been sent.
>> * If delaying the check causes it to switch to a state which should
>>    not result in a notification, no notification should be sent out.
> 
> (That's how escalations *already* behave WRT earlier non- or
> lesser-escalated notifications, isn't it? Hence, The Right Thing To Do
> (tm) in my books.)
> 

AFAICS, yes.

>> * Delaying a notification should not increase its notification_number,
>>    and will, as such, affect both regular and escalated notifications.
> 
> *Most definitely* agreed! I know several organizations which would be
> confused to no end if I had to tell them that, under certain
> circumstances, there *just was no* notification #n preceding the #n+1
> they received and try to figure out.
> 
>> * Custom-, downtime-, acknowledgement and flapping notifications will
>>    never be delayed (flapping is arguable, but matches current code).
> 
> I am not aware, off the top of my head, of how Acknowledgment and
> Flapping notifications are supposed to behave WRT earlier notifications
> (as in "RECOVERYs are only sent to contacts who also had the PROBLEM
> sent to them"). If such a dependency does/should/will exist, whether or
> not to exempt them from first_notification_delay translates into
> potentially different sets of recipients.
> 

Recovery notifications are sent only to the contacts that supposedly got
the problem notification. It doesn't always work for escalations though;
Only the current tier of escalation will get the recovery notification.
One could argue if that's correct or not, but that's how it is today at
any rate.

> For acknowledgments, sending the notification early (and to the
> *restricted* set of recipients) is likely what the person acknowledging
> the problem *wants* to happen. FWIW, same thing for Downtimes, which are
> technically prophetic acknowledgments. ;-)
> 
> Customs can probably lean both ways, depending on what you use them for.
> 

There's also an extra option for custom notifications, which is to let the
notifyer select how hard the notification should be forced, but I think
that's unnecessary complication. Right now, custom notifications always go
to the primary contacts and the escalated ones aren't considered.

> Flapping ....... I'll have to pass on that. The things I monitor do not
> really flap, and flapping detection is typically disabled.
> 

Flapping is the only real issue, actually. It's a problem, of sorts, but
one where we by empirical evidence should flip a coin to see if we should
notify or not. I think the correct thing to do would be to wait until the
flapping ends and if it comes out in a problem state, add the time the
node was flapping to problem_duration, which is measured against the
notify_delay to see if it's time to notify or not.
That means that a service that starts flapping at 15:05, stops flapping at
15:08 and goes into non-flapping hard critical state should get the three
minutes it spent flapping discounted from the first_notification_delay.

In terms of code, that means we should invent a new "problem" data type
where we can stash the history of a particular problem, and which contacts
have been notified about it. Sounds like a 4.1 thing, really.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
Remotely access PCs and mobile devices and provide instant support
Improve your efficiency, and focus on delivering more value-add services
Discover what IT Professionals Know. Rescue delivers
http://p.sf.net/sfu/logmein_12329d2d




More information about the Developers mailing list