Host check running before service check retry interval

Tedman Eng teng at dataway.com
Wed Sep 22 20:32:04 CEST 2004

Previous message: Host check running before service check retry interval
Next message: SQL Server monitoring
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

We have the same problem.  On some of our monitored hosts, the internet
connection is very poor.  We don't care that those hosts are down for a
little while, but we do care if they are down for more than 1 hour.  The
only ways I've found so far is to use one of the methods you've described.
We're using escalation now, but it feels a bit kludgy.


-----Original Message-----
From: Robert Nelson [mailto:rnelson at windchannel.com]
Sent: Wednesday, September 22, 2004 3:17 AM
To: Nagios-users at lists.sourceforge.net
Subject: [Nagios-users] Host check running before service check retry
interval


Hello,

I'm having a problem with a few hosts on our network. We're a WISP, and
there are a few clients who create their own problems. Like the construction
site that parks a crane in front of their radio for 10-15 minutes at a time
while loading materials. However, if the radio stays down for more than 30
minutes, we care about it (Funny story, a very special crane operator lifted
some steel beams up and caught them under the edge of the trailer, almost
flipping it. He then proceeded to snag the steel beams on our cat5 cable
going to the radio...).

I set the service checks for this one host to have a max_checks of 3 and a
retry_interval of 10, which should give me 30 minutes. This never seems to
happen, though. As soon as it fails once, a host check is run that fails, it
puts it in a hard down state, and we're back to being notified immediately.

"When a service check results in a non-OK state, Nagios will check the host
that the service is associated with to determine whether or not is up (see
the note below for info on how this is done). If the host is not up (i.e. it
is either down or unreachable), Nagios will immediately put the service into
a hard non-OK state and it will reset the current attempt number to 1."

If I read the above correctly, that's why this is happening! Is there a
suggested way to get around this and have an effective 30 minute non-OK
interval before ANY notifications?


Two ways I can think of:

1) Use check_dummy for the host check. Downside to this is that host
reporting will be broken and we'll be relying exclusively on the service
check for reporting. This will also break the parent-child relationship
built up for host UNREACHABLE notifications.

2) Find some check_whatever plugin that returns the last HARD value state
for the services. i.e. if there's at least one service that is HARD
OK/WARNING or in a SOFT change, return OK. If all the services are in a HARD
CRITICAL/UNREACHABLE state, return a DOWN. Seems like a useful check plugin
to me but I haven't found it.

Am I going about this the wrong way? I could also do escalations, but in the
example I gave, I'd have to break that radio out of the radios hostgroup to
eliminate early notifications, which would just plain break the usefulness
of hostgroups.

Rob Nelson


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting
any issue. 
::: Messages without supporting info will risk being sent to /dev/null


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: Host check running before service check retry interval
Next message: SQL Server monitoring
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list