Host retry interval

Eli Stair estair at ilm.com
Mon May 22 21:49:49 CEST 2006


HOST check mode is initiated once the service reaches the HARD state, which
comes on the last allowed check as defined by max_check_attempts for that
service.  Ergo, if you set your service checks max to 15, after 15 minutes
(assuming your delay is 60 seconds) your service will hit a HARD CRITICAL,
and host checks will fire.

/eli


On 5/22/06 12:45 PM, "Kyle Tucker" <kylet at panix.com> wrote:

> Thanks Eli,
> 
> According to the docs, "Nagios checks the status of a host is when a service
> check results in a non-OK status.". Is that when it reaches a HARD state
> after all iterations of max_check_attempts are done or as soon as it goes to
> non-OK (SOFT) state? If the latter, which seems to be when it's kicking off
> host checks, I don't see how increasing the service checks will help.
> 
>> I'd suggest NOT taking this action at the host level; reason being that all
>> service checks are halted for the duration of this non-parallelized
>> action... You want to avoid doing a host check until absolutely necessary.
>> Suggest increasing the max_check_attempts on the SERVICE to a larger number,
>> this will avoid impacting the monitoring system as a whole.
>> 
>> /eli
>> 
>> 
>> On 5/22/06 11:45 AM, "Kyle Tucker" <kylet at panix.com> wrote:
>> 
>>> Hi,
>>> I have many hosts that are constantly giving me DOWN/UP state as they
>>> are unreachable for certain periods. In an attempt to give the system more
>>> time to become available, I increased the max_check_attempts from 2 to 5. At
>>> 2 the interval between retry attempts was 10 seconds. Now at 5, the interval
>>> is 7 seconds. I'd like to have this interval higher for some hosts, but
>>> there's a real scary note on the hosts check_interval option to not use it
>>> if you can help it. Is this my only option or is there a better way? I am
>>> also following the thread titled "Workaround for 'Host DOWN'
>>> false-positives"
>>> but I don't clearly see how to set that up. Here's some output for a failed
>>> host (I tail and parse the date field in a script so it's readable).
>>> 
>>> Mon May 22 14:25:52 2006  HOST ALERT: badhost;DOWN;SOFT;1;* system down * -
>>> snmpd not responding
>>> Mon May 22 14:25:59 2006  HOST ALERT: badhost;DOWN;SOFT;2;* system down * -
>>> snmpd not responding
>>> Mon May 22 14:26:06 2006  HOST ALERT: badhost;DOWN;SOFT;3;* system down * -
>>> snmpd not responding
>>> Mon May 22 14:26:16 2006  HOST ALERT: badhost;DOWN;SOFT;4;* system down * -
>>> snmpd not responding
>>> Mon May 22 14:26:23 2006  HOST ALERT: badhost;DOWN;HARD;5;* system down * -
>>> snmpd not responding
>>> 
>> 
> 



-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list