RFC: Downtime and flapping

Jochen Bern Jochen.Bern at LINworks.de
Mon Feb 7 12:25:17 CET 2011


On 02/06/2011 09:57 PM, Ton Voon wrote:
> Stepping back, the purpose of flap detection is to disable notifications
> temporarily, but since scheduling downtime already disables notifications,
> does it make any sense to have flapping during downtimes?

Personally, I agree with you here. However, I *could* imagine people who
take flapping as a concept on the level of operational state (like check
results), rather than alerts (notifications) derived thereof, and might
want to have it kept visible and updated during downtimes very much like
we want the UI's "Current Status" field up to date.

Any such person on this list, by chance?

(Philosophical excursion: What you *really* want is to have your
statistics reset when you transition from "in downtime and manhandling
stuff" to "still officially in downtime, but already verifying success
of the intervention". That distinction, however, is nonexistent in Nagios.)

> So if we agree that downtime and flapping for the same object makes
> no sense when overlapping, I propose:
>   * if an object is in a flapping start state at the time of a downtime
>     start, a flapping stop is sent (this would need documenting that an
>     object goes can be flapping stop due to downtime starting. If a user
>     has downtime notifications, they'll get two notifications in this
>     case)
>   * when an object goes into downtime, the state history is erased (I'm
>     assuming the state history is only used for flap detection) and new
>     states coming in during this downtime are not recorded. When the
>     object comes out of downtime, state history starts again
> During a downtime, the flapping percent will always be 0 and then its an
> education/documentation issue that flap detection does not take effect
> in this period.
> Would that be better?

Your proposal differs from mine in that you flush the history upon
*start* of downtime (and keep it flushed all the way through), rather
than at the *end* of downtime. Both works for me, and I'ld even call
your version the cleaner concept. The above hypothetical person,
however, would disagree violently. :-}

Regards,
								J. Bern
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel

------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb




More information about the Developers mailing list