RFC: Downtime and flapping

Jochen Bern Jochen.Bern at LINworks.de
Fri Feb 4 12:59:00 CET 2011


On 02/04/2011 11:57 AM, Andreas Ericsson wrote:
> On 02/04/2011 11:30 AM, Jochen Bern wrote:
>> It IIUC also means that during the downtime, the CGI-bins will keep
>> displaying the *historic* flapping state, along with the *current*
>> host/service state.
> Perhaps, but it should clear up fairly rapidly, and if a FLAPPING_START
> notification was sent out, I'd expect to get a FLAPPING_STOP one when
> repairs are done, assuming that happens after downtime has ended.

You don't have a guarantee to see post-downtime FLAPPING_STOPs right now
(because they're not exempt from being blocked by the downtime, and
because you'd have to skip any kind of testing the UP-again service and
manually delete the remaining downtime right away to completely avoid
the time gap). Same effect if a notification_period is used - I *did*
search for the "bug" when colleagues reported that, when following up on
a service they last got a FLAPPING_START from, they found a non-flapping
OK in the UI.

>> Downtime disables notifications anyway, and there already is logic to
>> trigger actions when downtime ends (*). IMHO, the proper way to provide
>> a clean slate after a downtime would be to flush (**) the entire history
>> at that point.
> Effectively lying about state history? No thanks.

You talk about lying, I talk about misleading. Deriving a flapping flag
from a state history whose entries hail from way in the past - no matter
whether updates were blocked by downtime, check_interval, dependencies,
a forced reschedule into the distant future, or whatever - qualifies for
the latter.

>> (**) Whether the bins should be reset to OK, PENDING,
>> last-before-downtime or the current post-downtime $*STATE$ (if one is
>> already available) is up for discussion ...
> Current state will always be current state. I'm not going to change
> that, ever. Most of our customers regularly check the ui during repairs
> to see if the service is up and running as expected. Showing anything
> but the *real* current state there would be counterproductive for all
> nagios users.

You might want to note that I never asked for overwriting "the" current
state (rather than the history bins) in the first place. Matter of fact,
the UI's flapping marker being happily derived from a *stale* history is
sort of my main point.

Regards,
								J. Bern
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel

------------------------------------------------------------------------------
The modern datacenter depends on network connectivity to access resources
and provide services. The best practices for maximizing a physical server's
connectivity to a physical network are well understood - see how these
rules translate into the virtual world? 
http://p.sf.net/sfu/oracle-sfdevnlfb




More information about the Developers mailing list