managing latency-induced host down alerts

Michael W. Lucas mwlucas at blackhelicopters.org
Wed Sep 12 16:46:21 CEST 2007


Hi,

I'm using Nagios 2.9 on FreeBSD, on a wide area network that has
remote networks scattered across the USA and Mexico.

We have a problem where latency on some remote circuits rises due to
congestion.  This means that various service checks time out, as they
take more than 10 seconds to complete.  (Yes, this is a real problem,
and we're addressing it.  I'm using smokeping to track latency at
these sites now, analyzing traffic, etc.)

When we get a latency delay, Nagios checks the host to see if it's
alive.  Latency is too high, so the host check times out.  Host checks
"crawl" up the chain to the parent router for the site, and flag it as
down.  The end result is that Nagios sees brief two-minute outages at
the remote site.

When we get a Nagios alert, it goes into our trouble ticket system and
is distributed to the appropriate administrator.  When the ticket is
issued for latency, however, it is a) viewed as a "false positive" and
b) detracts from real remote site outages.

With Nagios 3 I would repeat the host check five minutes later before
sending an alert.  That's not an option in Nagios 2.9.  I'm not
entirely comfortable running beta code in this production environment,
for political reasons rather than technical ones.

I'd like to separate the latency problem from a site down problem.  I
can think of a couple ways to do this:

1) increase the 10-second maximum timeout for a service check to
complete.  Can this be done in Nagios?

2) have the trouble ticket system be a escalation contact that is only
notified after the problem persists for five minutes.  We're not using
escalations today, but they can't be too hard.

Has anyone dealt with this type of problem before?  Any other
suggestions or advice on monitoring and alarming in this sort of
environment?

Thanks,
==ml

-- 
Michael W. Lucas 	mwlucas at BlackHelicopters.org, mwlucas at FreeBSD.org
		http://www.BlackHelicopters.org/~mwlucas/
      Coming Soon: "Absolute FreeBSD" -- http://www.AbsoluteFreeBSD.com
On 5/4/2007, the TSA kept 3 pairs of my soiled undies "for security reasons."

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list