Removing host checks for non-OK passive results

Bruce Campbell nagios-devel at vicious.dropbear.id.au
Fri May 19 20:06:13 CEST 2006


On Tue, 16 May 2006, Ethan Galstad wrote:

> Jason Martin wrote:
>> On Tue, May 16, 2006 at 04:03:10PM +0100, Ton Voon wrote:
>>> Hi Ethan!
>>> 
>>> We found that Nagios is making a host check for every non-OK passive 
>>> result received. We don't think that is necessary because if a  passive 
>>> result is received, then the host must be okay! Details and 
>> That is not necessairily true. In distributed mode, one Nagios
>> might send a service check result to another Nagios via NSCA
>> hence passive check result.  The host being monitored may be
>> down but you'll still get passive check results for it and don't
>> want the host assumed up.

Or more precisely, the host may well be 'down' from one monitoring node's 
point of view, and 'up' from another monitoring node's point of view. 
Imho, each monitoring node should maintain its own idea of host's up/down 
state, and not send/accept host check results between themselves. 
Service check results are a different issue.

>> However, if this was implemented as a config file option such
>> that users could invoke this only if they are not running Nagios
>> in a distributed manner, it would make sense to include it.
>> Otherwise it probably breaks distributed Nagios behavior.

The main implications with it is that such behaviour hides issues with a 
particular monitoring node being unable to contact a given host, usually 
because someone has ham-fisted the relevant ACLs.  This is fine in normal 
circumstances, but when all of the monitoring hosts that can reach the 
problematic hosts are not supplying results, then you'll get an annoying 
set of false-positives.

> Indeed, this would cause problems under a distributed environment.  I'll have 
> to think about this a bit and see whether or not (yet another) config file 
> option would be appropriate...  This will go on my list of outstanding 
> TODO/TOLOOKAT issues for 3.0.

This is probably a repeated 'TOLOOKAT' issue, but..

The problem at heart here is that Nagios wants to execute a host check 
each time a non-OK service result comes in (passive or active).  That is, 
if you have a host with 20 services, 15 of which have just failed due to 
some subtle inter-dependency that you didn't previously know about, Nagios 
will happily run 15 seperate host checks for the same host.

Ideally, Nagios just runs one host check after the first non-OK service 
result comes in, and uses the cached value as long as it is within the 
host's freshness_threshold.  Otherwise, your check_latency for everything 
goes way up, and you eventually write your own scheduler out of irritation 
at seeing service checks being executed at 5 hour intervals.

-- 
   Bruce Campbell


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list