RFC: Downtime and flapping

Ton Voon tonvoon at gmail.com
Sun Feb 6 21:57:57 CET 2011


On 4 Feb 2011, at 10:30, Jochen Bern wrote:

> On 02/03/2011 11:59 PM, Andreas Ericsson wrote:
>> On 02/03/2011 07:53 PM, Ton Voon wrote:
>>> From the code, I can see that Nagios does not record any soft
>>> non-OK states in this state history. Any objections if I add "host
>>> or service in downtime" to that exception?
>> None at all. In fact, +1 on doing so. This way, downtime makes all
>> effects of statechanges void and null
> 
> Umh, not quite, I'm afraid. It means that hosts/services will emerge
> from downtime with the history they had when they entered downtime
> way-back-when - which may well be the non-OK or FLAPPING which prompted
> you to schedule urgent repairs in the first place.
> 
> It IIUC also means that during the downtime, the CGI-bins will keep
> displaying the *historic* flapping state, along with the *current*
> host/service state.
> 
> Downtime disables notifications anyway, and there already is logic to
> trigger actions when downtime ends (*). IMHO, the proper way to provide
> a clean slate after a downtime would be to flush (**) the entire history
> at that point.
> 
> (*) Notification type "s" - BTW,
> http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html#contact
> lists services-"s" in the Definition Format but not in the Directive
> Descriptions.
> 
> (**) Whether the bins should be reset to OK, PENDING,
> last-before-downtime or the current post-downtime $*STATE$ (if one is
> already available) is up for discussion ...

I think your main objection is that the flapping calculation could be based on "very old states" and thus "inaccurate" and "unintuitive". I'm happy with making a more radical change if it makes sense.

Stepping back, the purpose of flap detection is to disable notifications temporarily, but since scheduling downtime already disables notifications, does it make any sense to have flapping during downtimes?

So if we agree that downtime and flapping for the same object makes no sense when overlapping, I propose:
  * if an object is in a flapping start state at the time of a downtime start, a flapping stop is sent (this would need documenting that an object goes can be flapping stop due to downtime starting. If a user has downtime notifications, they'll get two notifications in this case)
  * when an object goes into downtime, the state history is erased (I'm assuming the state history is only used for flap detection) and new states coming in during this downtime are not recorded. When the object comes out of downtime, state history starts again

During a downtime, the flapping percent will always be 0 and then its an education/documentation issue that flap detection does not take effect in this period.

Would that be better?

Ton





------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb




More information about the Developers mailing list