Detecting partial outages

Andreas Ericsson ae at op5.se
Mon Aug 27 14:14:31 CEST 2007

Previous message: Detecting partial outages
Next message: Disk check by check_by_ssh or nrpe or wrapper, which is better?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

David Barrett wrote:
> Is there any way to configure Nagios to detect and ignore partial outages?
> 
> Specifically, I have multiple datacenters for my production service, and
> then two separate locations from which I do monitoring.  It's very rare that
> any of the production datacenters goes down, but it does happen on occasion
> where one of the datacenters becomes inaccessible from only *one* of the
> monitoring stations.
> 
> (In other words, the datacenter is up and running fine, and appears
> accessible by real users, but looks down to one of my monitoring stations.)
> 
> Is there any way to configure Nagios to detect this sort of "partial outage"
> condition and ignore it?  I only want to be notified if it's reported down
> by *both* monitoring stations.
> 

If the production centers each hold a nagios server each, there's no way
you can accomplish this, so I'll assume your two nagios servers can still
communicate even when either data-center is down.

The best solution would be to have a neb-module that communicates check-
results between the two nagios-servers. When a check is about to be sent,
have the same neb-module check the status on that secondary nagios-server
and block the notification if either one reports the host as up. This way
you'll get an additional minor delay before receiving a notification, but
since you can force a check on either nagios from within the module whenever
the second server reports a failure, it should be a very minimal one.

Hacking up such a module should take about a week, assuming whoever does
the work is well-versed in C and has a decent grasp of nagios' internals.

A second option is to let an event-handler report the checkresults to the
other server and adding them to a list of some sort (database, flat-file, 
whatever) and then modifying your notification script to only actually
send notifications when both the servers report something as down.

Assuming you use "notification_interval 0" for all your hosts and
services, only the server that does the last check of whatever
host/service it should report on will send a notification. This shouldn't
take much more than a day to hack up, but is less elegant. With a shared
network-capable database it shouldn't be too much trouble though.

There are more options, but those are the two elegant ones I can think of.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Detecting partial outages
Next message: Disk check by check_by_ssh or nrpe or wrapper, which is better?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list