Hosts report 'DOWN, HARD' after first attempt.

Patrick Morris patrick.morris at hp.com
Fri Jan 16 19:40:03 CET 2009


On Fri, 16 Jan 2009, Jonathan Call wrote:

> I am running a distributed monitoring system using Nagios 2.11 on
> FreeBSD 6.3. I use NSCA to send host and services events to the central
> server from the slave servers and have always had the following problem:
> 
> A distributed server notices a host service is "non-Ok" and fires off
> check-host-alive. I have it set up to do check_ICMP and so it fires off
> five ICMP packets. Since the network isn't always perfect those five
> packets get dropped. However, I have my max_retry_interval set to 3 so
> it fires off another check_ICMP which completes just fine. As a result I
> see the following events take place on the slave server:
> 
> [01-16-2009 15:18:46] HOST ALERT: s3200.blah.net;UP;SOFT;2;OK -
> 10.XX.XX.XX: rta 100.294ms, lost 0%
> [01-16-2009 15:18:46] HOST ALERT: s3200.blah.net;DOWN;SOFT;1;CRITICAL -
> 10.XX.XX.XX: rta nan, lost 100%
> 
> However on the central server I see the following:
> 
> [01-16-2009 15:19:02] HOST NOTIFICATION:
> NOC-email;s3200.blah.net;UP;host-notify-by-email;OK - 10.XX.XX.XX: rta
> 100.294ms, lost 0%
>  [01-16-2009 15:19:01] HOST ALERT: s3200.blah.net;UP;HARD;1;OK -
> 10.XX.XX.XX: rta 100.294ms, lost 0%
> [01-16-2009 15:19:01] HOST NOTIFICATION:
> NOC-email;s3200.blah.net;DOWN;host-notify-by-email;CRITICAL -
> 10.XX.XX.XX: rta nan, lost 100%
> [01-16-2009 15:19:01] HOST ALERT: s3200.blah.net;DOWN;HARD;1;CRITICAL -
> 10.XX.XX.XX: rta nan, lost 100%
> 
> The central server is immediately flagging the host as DOWN, HARD in
> spite of having the same max_retry_interval = 3 setting. On some hosts
> this is generating a tone of false "HOST DOWN" notifications. Is there
> any way to fix it?

The max_check_attempts only applies to active checks, not the passive
ones you're sending the central server (at least I assume when you said 
max_retry_interval you meant max_check_attempts)  -- and you may note 
that SOFT and HARD are only relative to the server doing the checking; 
they probably aren't passed as part of the passive check submission 
process.  In short, passive host checks are a bit of a pain.

I'm not sure exactly how you're passing check results to the central
server, but you may want to consider modifying the process to only send
host check results when they are in a hard state.

------------------------------------------------------------------------------
This SF.net email is sponsored by:
SourcForge Community
SourceForge wants to tell your story.
http://p.sf.net/sfu/sf-spreadtheword
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list