Removing host checks for non-OK passive results

Ton Voon ton.voon at altinity.com
Wed May 24 14:00:31 CEST 2006


On 24 May 2006, at 10:59, Joerg Linge wrote:

> Am Mittwoch 24 Mai 2006 11:37 schrieb Bruce Campbell:
>>> We considered using a "cache" value for a host status - I think  
>>> the idea
>>> has merit and would reduce a large number of host checks,  
>>> especially if
>>> something suddenly happened to a large set of services on one host.
>>> However, we baulked at going ahead because there's bound to be some
>>> subtle situation where this would be undesireable.
>>
>> See the "Workaround for 'Host DOWN' false-positives" thread for  
>> another
>> way of doing it (slurp in the entire status.dat file if you've got  
>> a small
>> installation, submit passive host check results from a service  
>> check if
>> you've got a large installation).  Both have the advantage of  
>> being driven
>> by Nagios.
>
> There is another Tread  'Host DOWN' false-positives" on nagios-users.
> What do you think about that solution ?

I've just reviewed that thread. Please correct if my summary below is  
wrong.

PROBLEM

Intermittent connectivity failures across a WAN can give an outage of  
1 minute. The host check run by Nagios has max_check_attempts of 10,  
but since the host check attempts are run immediately without a retry  
interval, the host will go into a HARD failure state before the WAN  
recovers.

SUGGESTED SOLUTIONS

The basic premise is that the host status is a reflection of a  
suitable service status. There are 3 techniques:
   1. Use a dependant service. If this fails, then the host check  
will be run, which is finding the result of this dependant service  
via status.dat to use as the actual status
   2. Using check_cluster for a similar trick
   3. Get a service to submit a host check result

This doesn't seem to be the same thing that this thread is about  
(reducing the amount of invocations of the host check because of non- 
OK statuses from active or passive checks).

While the solutions above do the job of updating the host status, you  
lose the "specialness" about host checks (invoked on-demand,  
reachability logic, etc).

Going back to the original problem, would a retry_check_interval for  
host checks help with this particular case? I'm not sure how this  
affects Nagios' scheduling because host checks are serialised  
(although Ethan says this will be changed in http://nagios.org/ 
development/upcoming.php), but this would spread the retry so then  
the HARD state will not be invoked unless the outage was over a  
longer period.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20060524/417534b0/attachment.html>


More information about the Developers mailing list