Reproducible bug scheduling second host check following failure

Matthew Richardson matthew-ln at itconsult.co.uk
Sat Dec 29 19:11:37 CET 2007


I have, for some time, been suspicions that Nagios 3 appeared a bit slow
alerting on host failures.  In order to diagnose this, I setup a minimal
config to try to show the problem using 3.0rc1 with a default nagios.cfg.
It consists of one host and one service (both checked by ping), with the
retries and timers set as below (the config file is attached in full):-

|define host{
|       host_name                       hostname
|	check_interval                  10
|	retry_interval                  1
|	max_check_attempts              5
|}
|define service{
|	max_check_attempts              5
|	normal_check_interval           5
|	retry_check_interval            5
|       host_name                       hostname
|       service_description             ping
|}

The host was up at 16:25 with the next host check scheduled for 16:35 and
the next service check for 16:30:-

|[2007-12-29 16:25:42] [1198945542.102847] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:30:32 2007
|[2007-12-29 16:25:52] [1198945552.201392] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=1/5, Type=HARD, Final State=0
|[2007-12-29 16:25:52] [1198945552.201763] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:35:52 2007

The host was then unplugged and the service was checked as expected at
16:30.  As the service was down, the host (again as expected) was checked.
The next host check was scheduled for 16:32 as expected with the host
retry_interval of 1 minute:-

|[2007-12-29 16:30:32] [1198945832.178688] [016.0] [pid=3660] Attempting to run scheduled check of service 'ping' on host 'hostname': check options=0, latency=0.178000
|[2007-12-29 16:30:52] [1198945852.109282] [016.1] [pid=3660] HOST: hostname, SERVICE: ping, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CRITICAL - Host Unreachable (m18.itconsult.net)
|[2007-12-29 16:30:52] [1198945852.109557] [016.1] [pid=3660] Service is in a non-OK state
|[2007-12-29 16:30:52] [1198945852.109579] [016.1] [pid=3660] Host is currently UP, so we'll recheck its state to make sure...
|[2007-12-29 16:30:52] [1198945852.109626] [016.0] [pid=3660] ** Running async check of host 'hostname'...
|[2007-12-29 16:30:52] [1198945852.127798] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:35:32 2007
|[2007-12-29 16:31:02] [1198945862.209380] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=1/5, Type=SOFT, Final State=1
|[2007-12-29 16:31:02] [1198945862.210709] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:32:02 2007
|[2007-12-29 16:31:02] [1198945862.210735] [016.2] [pid=3660] Found another host check event for this host...
|[2007-12-29 16:31:02] [1198945862.210760] [016.2] [pid=3660] Keeping original host check event (ignoring the new one).
|[2007-12-29 16:31:02] [1198945862.210831] [016.1] [pid=3660] ** Async check result for host 'hostname' handled: new state=1

However, although the CGIs continued to show the next host check for 16:32,
nothing happened until:-

|[2007-12-29 16:35:32] [1198946132.119030] [016.0] [pid=3660] Attempting to run scheduled check of service 'ping' on host 'hostname': check options=0, latency=0.118000
|[2007-12-29 16:35:42] [1198946142.223011] [016.0] [pid=3660] Scheduling a non-forced, active check of service 'ping' on host 'hostname' @ Sat Dec 29 16:40:32 2007
|[2007-12-29 16:35:52] [1198946152.053988] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.053000
|[2007-12-29 16:35:52] [1198946152.054042] [016.0] [pid=3660] ** Running async check of host 'hostname'...
|[2007-12-29 16:36:02] [1198946162.152677] [016.2] [pid=3660] Host has no parents, so it is DOWN.
|[2007-12-29 16:36:02] [1198946162.152699] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=2/5, Type=SOFT, Final State=1
|[2007-12-29 16:36:02] [1198946162.154051] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:37:02 2007

Thereafter, the remaining host checks were executed at the correct
retry_interval of 1 minute:-

|[2007-12-29 16:37:02] [1198946222.148471] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.148000
|[2007-12-29 16:37:12] [1198946232.008071] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=3/5, Type=SOFT, Final State=1
|[2007-12-29 16:37:12] [1198946232.009395] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:38:12 2007
|[2007-12-29 16:38:12] [1198946292.239641] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.239000
|[2007-12-29 16:38:22] [1198946302.086426] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=4/5, Type=SOFT, Final State=1	
|[2007-12-29 16:38:22] [1198946302.087774] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:39:22 2007
|[2007-12-29 16:39:22] [1198946362.069056] [016.0] [pid=3660] Attempting to run scheduled check of host 'hostname': check options=0, latency=0.068000
|[2007-12-29 16:39:32] [1198946372.181915] [016.1] [pid=3660] Pre-handle_host_state() Host: hostname, Attempt=5/5, Type=HARD, Final State=1
|[2007-12-29 16:39:32] [1198946372.184009] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:49:32 2007


Having now prepared this bug report, I suspect that the problem relates
to:-

|[2007-12-29 16:31:02] [1198945862.210735] [016.2] [pid=3660] Found another host check event for this host...
|[2007-12-29 16:31:02] [1198945862.210760] [016.2] [pid=3660] Keeping original host check event (ignoring the new one).

which causes the previous host check:-

|[2007-12-29 16:25:52] [1198945552.201763] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:35:52 2007

incorrectly to take precedence over:-

|[2007-12-29 16:31:02] [1198945862.210709] [016.0] [pid=3660] Scheduling a non-forced, active check of host 'hostname' @ Sat Dec 29 16:32:02 2007

This is completely reproducible, although I did adjust some of the timers &
retries to make it easier to demonstrate.  In case it is relevant, I have
been testing on Centos 4.6.

Finally, just in case they are useful, the full debug logs are available
from: http://user.itconsult.co.uk/temp/itconsult1229a.tar.gz

Best wishes,
Matthew
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test1.cfg
Type: application/octet-stream
Size: 2885 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20071229/61d867b6/attachment.obj>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list