Host checks under Nagios 1.x

Andrew Cruse andrew at profitability.net
Mon Apr 21 20:40:16 CEST 2008


I've got an interesting problem with a particular setup.  I'm monitoring a
number of servers that the main Nagios installation doesn't have direct
network access to, so I pass all of the host and service checks through an
NRPE installation that can communicate with both Nagios and the servers
being monitored.  A little tweaking with check timeouts and whatnot and this
setup works pretty nicely.  I've run into a problem where for some reason,
the NRPE server periodically stops responding to NRPE requests.  Haven't
gotten to the bottom of that (Connection refused) yet.  Service checks are
able to handle the problem fine as the duration of the NRPE outage is much
shorter than the time it takes for the services to go into a hard critical
state.  The problem is, once the first service check goes through and goes
into a soft critical state, that triggers the host checks which also fail
(host checks go through NRPE as well) and immediately generate a
notification.  I'd like to find a way to make the host checks a little more
forgiving as well.

A few things I've thought of or tried:

1.  I tried bumping up the host check retries to 30, but since the checks
immediately fail with "connection refused" it runs through all 30 tries
within just a few seconds.  I also worry about this leading to unneeded load
on the Nagios server since this is generally going to cause check_nrpe to be
run 30 times, for each of the ~20 servers in this setup.

2.  Extending the timeout on the check_nrpe commands doesn't help because
"connection refused" is returned immediately.

3.  Switching to a passive setup is probably the way to go, but for now am
trying to avoid all the reconfiguration needed to move in that direction.


Ideally what I'd like to be able to do is have the host checks retry on a
particular interval (i.e. once per second) rather than instantly after the
previous executed.  Is there a way to do this?

Incidentally, while typing up this email I was actually able to find the
root problem with the NRPE setup.  NRPE was being called via Xinetd which
wasn't configured to allow enough simultaneous connections for a single
service.  Thus when it started getting hammered with NRPE requests as a
result of the host check configuration it would stop allowing NRPE
connections for 30 seconds.  A quick change to the Xinetd config file seems
to have solved the problem.

I'm still interested to know how anyone handles the situation where a host
may be unresponsive to host checks for a period of time yet you only wish to
fire off a notification after a specific period of time.  Would a wrapper
around the host check be the only way to handle it?

Andrew


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list