RFC: Downtime and flapping

Andreas Ericsson ae at op5.se
Fri Feb 4 11:57:09 CET 2011


On 02/04/2011 11:30 AM, Jochen Bern wrote:
> On 02/03/2011 11:59 PM, Andreas Ericsson wrote:
>> On 02/03/2011 07:53 PM, Ton Voon wrote:
>>>  From the code, I can see that Nagios does not record any soft
>>> non-OK states in this state history. Any objections if I add "host
>>> or service in downtime" to that exception?
>> None at all. In fact, +1 on doing so. This way, downtime makes all
>> effects of statechanges void and null
> 
> Umh, not quite, I'm afraid. It means that hosts/services will emerge
> from downtime with the history they had when they entered downtime
> way-back-when - which may well be the non-OK or FLAPPING which prompted
> you to schedule urgent repairs in the first place.
> 

True, but urgent repairs often cause flapping.

> It IIUC also means that during the downtime, the CGI-bins will keep
> displaying the *historic* flapping state, along with the *current*
> host/service state.
> 

Perhaps, but it should clear up fairly rapidly, and if a FLAPPING_START
notification was sent out, I'd expect to get a FLAPPING_STOP one when
repairs are done, assuming that happens after downtime has ended.

If flapping starts during downtime, no flapping start notifications
will be sent out, so no flapping stop notifications will go out either.

> Downtime disables notifications anyway, and there already is logic to
> trigger actions when downtime ends (*). IMHO, the proper way to provide
> a clean slate after a downtime would be to flush (**) the entire history
> at that point.
> 

Effectively lying about state history? No thanks.

> (*) Notification type "s" - BTW,
> http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#contact
> lists services-"s" in the Definition Format but not in the Directive
> Descriptions.
> 
> (**) Whether the bins should be reset to OK, PENDING,
> last-before-downtime or the current post-downtime $*STATE$ (if one is
> already available) is up for discussion ...
> 

Current state will always be current state. I'm not going to change
that, ever. Most of our customers regularly check the ui during repairs
to see if the service is up and running as expected. Showing anything
but the *real* current state there would be counterproductive for all
nagios users.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb




More information about the Developers mailing list