Host checks under Nagios 1.x

Aaron Devey adevey at omniture.com
Tue Apr 22 02:30:48 CEST 2008


I had a similar problem to this.  I only wanted to know if a
not-so-important device had been down for an hour or more.

Here's what I ended up doing:
I disabled the host check (by having it call an "always-ok" checkcommand
that always returns 0.)  I then added a 'PING' service to the host with
a max_check_attempts of 7, and a retry_check_interval of 10 minutes.

The pitfall being that I no longer receive 'HOST DOWN' alerts for that
host.  I instead receive alerts for a failing 'PING' service.

-Aaron


Andrew Cruse wrote:
> I've got an interesting problem with a particular setup.  I'm monitoring a
> number of servers that the main Nagios installation doesn't have direct
> network access to, so I pass all of the host and service checks through an
> NRPE installation that can communicate with both Nagios and the servers
> being monitored.  A little tweaking with check timeouts and whatnot and this
> setup works pretty nicely.  I've run into a problem where for some reason,
> the NRPE server periodically stops responding to NRPE requests.  Haven't
> gotten to the bottom of that (Connection refused) yet.  Service checks are
> able to handle the problem fine as the duration of the NRPE outage is much
> shorter than the time it takes for the services to go into a hard critical
> state.  The problem is, once the first service check goes through and goes
> into a soft critical state, that triggers the host checks which also fail
> (host checks go through NRPE as well) and immediately generate a
> notification.  I'd like to find a way to make the host checks a little more
> forgiving as well.
> 
> A few things I've thought of or tried:
> 
> 1.  I tried bumping up the host check retries to 30, but since the checks
> immediately fail with "connection refused" it runs through all 30 tries
> within just a few seconds.  I also worry about this leading to unneeded load
> on the Nagios server since this is generally going to cause check_nrpe to be
> run 30 times, for each of the ~20 servers in this setup.
> 
> 2.  Extending the timeout on the check_nrpe commands doesn't help because
> "connection refused" is returned immediately.
> 
> 3.  Switching to a passive setup is probably the way to go, but for now am
> trying to avoid all the reconfiguration needed to move in that direction.
> 
> 
> Ideally what I'd like to be able to do is have the host checks retry on a
> particular interval (i.e. once per second) rather than instantly after the
> previous executed.  Is there a way to do this?
> 
> Incidentally, while typing up this email I was actually able to find the
> root problem with the NRPE setup.  NRPE was being called via Xinetd which
> wasn't configured to allow enough simultaneous connections for a single
> service.  Thus when it started getting hammered with NRPE requests as a
> result of the host check configuration it would stop allowing NRPE
> connections for 30 seconds.  A quick change to the Xinetd config file seems
> to have solved the problem.
> 
> I'm still interested to know how anyone handles the situation where a host
> may be unresponsive to host checks for a period of time yet you only wish to
> fire off a notification after a specific period of time.  Would a wrapper
> around the host check be the only way to handle it?
> 
> Andrew
> 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
> 


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list