Host check retries - Nagios bug?

Emanuel Massano emanuel.massano at fccn.pt
Tue Jun 17 12:47:48 CEST 2008


Hi all,

I've sent a related email last week to the nagios-users but got no
answer and since I believe I've found a bug, I'm posting to devel this
time. Sorry if it's not the right place (I believe it is) for reporting
this problem.
The related email is as an attachment and I've been doing some tests
since sending it.

Fisrt of all, let me show a part of my log file:

[1213641058] SERVICE ALERT: test;ping;CRITICAL;SOFT;1;CRITICAL -
x.x.x.x: rta nan, lost 100%
[1213641077] Warning: The check of host 'test' looks like it was
orphaned (results never came back).  I'm scheduling an immediate check
of the host...
[1213641098] HOST ALERT: test;DOWN;SOFT;1;CRITICAL - x.x.x.x: rta nan,
lost 100%
[1213641098] GLOBAL HOST EVENT HANDLER:
test;(null);(null);(null);sd_host_incident
[1213641118] SERVICE ALERT: test;ping;CRITICAL;HARD;1;CRITICAL -
x.x.x.x: rta nan, lost 100%
[1213641118] HOST ALERT: test;DOWN;HARD;1;CRITICAL - x.x.x.x: rta nan,
lost 100%
[1213641118] HOST NOTIFICATION:
emanuel;test;DOWN;host-notify-by-email;CRITICAL - x.x.x.x: rta nan, lost
100%
[1213641118] HOST NOTIFICATION:
helpdesk;test;DOWN;fccn_HostNotify;CRITICAL - x.x.x.x: rta nan, lost 100%


And then some considerations: I am not using regularly scheduled host
checks, my host checks always delays 40s because it's the timeout seted
on check_icmp, the max_check_attempts is 10 and my general timeouts are:
service_check_timeout=90
host_check_timeout=120
event_handler_timeout=30
notification_timeout=30

The normal expected behavior is:
- when a host goes down, the first message is a service problem
- when a service problem is found, a host check is immediately executed
- the host check is executed and it's duration is 40s if not it's
stopped after 120s
- there will be 9 soft down states
- the 10th attempt will be a hard down state and notifications will be sent

The strange behavior I've found is:
- the nagios process waits only 20 seconds for the first host check and
not 120 as expected, showned as a warning message. Then nagios executes
an immediate host check
- the first check is received as normally after 40s, but nagios is
already executing another test
- the second check delays the normal ~40s but the check immediately
wents on HARD state.
- there aren't 10 attempt as expected by the max_check_attempts clause,
because of the strange behavior showned in here.


So my question is, is this a bug, or a configuration problem?

Thank you very much for any help, since this problem is driving my
helpdesk team nuts :O because of the false alarms.


Best regards,
Emanuel Massano

-------------- next part --------------
An embedded message was scrubbed...
From: Emanuel Massano <emanuel.massano at fccn.pt>
Subject: Host check retries
Date: Thu, 12 Jun 2008 14:21:55 +0100
Size: 11429
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20080617/034f54f8/attachment.mht>
-------------- next part --------------
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list