Removing host checks for non-OK passive results

Ton Voon ton.voon at altinity.com
Tue May 23 16:34:29 CEST 2006


On 19 May 2006, at 19:06, Bruce Campbell wrote:

> On Tue, 16 May 2006, Ethan Galstad wrote:
>
>> Jason Martin wrote:
>>> On Tue, May 16, 2006 at 04:03:10PM +0100, Ton Voon wrote:
>>>> Hi Ethan!
>>>> We found that Nagios is making a host check for every non-OK  
>>>> passive result received. We don't think that is necessary  
>>>> because if a  passive result is received, then the host must be  
>>>> okay! Details and
>>> That is not necessairily true. In distributed mode, one Nagios
>>> might send a service check result to another Nagios via NSCA
>>> hence passive check result.  The host being monitored may be
>>> down but you'll still get passive check results for it and don't
>>> want the host assumed up.
>
> Or more precisely, the host may well be 'down' from one monitoring  
> node's point of view, and 'up' from another monitoring node's point  
> of view. Imho, each monitoring node should maintain its own idea of  
> host's up/down state, and not send/accept host check results  
> between themselves. Service check results are a different issue.

We setup distributed monitoring across internationally spread  
datacenters. With firewall policies, only the local monitoring server  
can ping their local hosts. Thus the central monitoring server really  
has no idea about whether a node is up or down - it has to rely on  
the slave monitoring server.
>
>> Indeed, this would cause problems under a distributed  
>> environment.  I'll have to think about this a bit and see whether  
>> or not (yet another) config file option would be appropriate...   
>> This will go on my list of outstanding TODO/TOLOOKAT issues for 3.0.
>
> This is probably a repeated 'TOLOOKAT' issue, but..
>
> The problem at heart here is that Nagios wants to execute a host  
> check each time a non-OK service result comes in (passive or  
> active).  That is, if you have a host with 20 services, 15 of which  
> have just failed due to some subtle inter-dependency that you  
> didn't previously know about, Nagios will happily run 15 seperate  
> host checks for the same host.
>
> Ideally, Nagios just runs one host check after the first non-OK  
> service result comes in, and uses the cached value as long as it is  
> within the host's freshness_threshold.  Otherwise, your  
> check_latency for everything goes way up, and you eventually write  
> your own scheduler out of irritation at seeing service checks being  
> executed at 5 hour intervals.

Hmm, not sure about writing your own scheduler :)

We considered using a "cache" value for a host status - I think the  
idea has merit and would reduce a large number of host checks,  
especially if something suddenly happened to a large set of services  
on one host. However, we baulked at going ahead because there's bound  
to be some subtle situation where this would be undesireable.

If the idea is validated through this thread (seems like the best way  
to test a design!), then we maybe able to subsidise the development  
of it at Altinity.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon




-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list