Filtering out false alarms in unreliable network

Andreas Ericsson ae at op5.se
Wed Oct 1 21:32:21 CEST 2008


Tuomas Toropainen wrote:
> The problem: how to filter out false alarms caused by short-time breaks in
> an unreliable network.
> 
> Think about a simple monitoring scenario in which you only want to ping
> various devices to see if they are up or not. So you have 200 hosts with
> only one service (PING) for each.
> 
> For a reason or another, short-time breaks occur in the network. That is,
> a particular host does not reply to PINGs for e.g. 30 seconds. These
> breaks should not cause a notification to be sent.
> 
> What comes to services, the filtering is easy with max_check_attempt and
> retry_check_interval. But the host check becomes a problem: after first
> PING failure (soft state) the host is checked, and there is no
> retry_check_interval for hosts. So the host is declared to be down
> (almost) immediately.
> 
> The notifications about hosts can be delayed using
> first_notification_delay. This seems to work fine except for one thing:
> flap detection. Even if the notification is not sent, the host (and
> service) is logged to have changed state, and when enough such state
> changes occur, the host (and service) is placed in flapping state.
> 
> I do not want to disable flapping detection (or flapping notifications)
> completely, because they might be useful in many cases. What I would like
> to achieve is not to count those short-time outages when computing
> flapping percent state changes. How can I accomplish that?
> 

You can't. Flapping detection was designed to detect changes over a short
time, so making it *not* do that would be the same as disabling it.

> Should I go ahead and disable host checks completely? If there only was
> retry_check_interval for hosts, it would solve all these problems.
> 

Upgrade to Nagios 3. Then you can specify retry_interval for hosts.

> I think it is quite common that the short-time outages are
> network-related, i.e. the complete host is unresponsive instead of a
> single service. When this is taken into account, it seems weird that there
> is retry_check_interval for services but not for hosts. Or would it ruin
> the scheduling logic?
> 

AFAIU, enabling a longer retry_interval * max_check_attempts for the host
than you have for the services on that host will cause the services to
send notifications even if the host later turns out to actually be down,
instead of just glitching. You'll want to look out for that. Other than
that, there's no real problem.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list