hosts change from soft to hard state after first attempt

Watson, Shaun shaun.watson at verizon.com
Tue Jan 24 21:18:58 CET 2012


I am looking for some help tracking down what I think is potentially a timing bug or something of the sort.

I show entries in my logs which look like this:

[1327419369] HOST ALERT: testhost;DOWN;SOFT;1;CRITICAL - 10.11.12.13: rta nan, lost 100%
[1327419369] HOST ALERT: testhost;DOWN;HARD;1;CRITICAL - 10.11.12.13: rta nan, lost 100%

The host entry is configured as active on-demand (not scheduled) with a check attempts of 5.  I see the above quite frequently where within the same second, it changes state to DOWN/SOFT 1/5 and then to DOWN/HARD 1/5.

After performing some debugging, it appears that there are two service checks failing back-to-back and both scheduling a on-demand host check within fractions of a second.  Looking through the codebase, it looks like there are provisions in place to abort a check if there is one already executing, but based on the output of the debug, there are two on-demand host checks being executed simultaneously.

Any help would be appreciated.  System is running Nagios 3.3.1 on RedHat ES5 32-bit.

[1327419359.134875] [016.0] [pid=12950] Attempting to run scheduled check of service 'check_ssh' on host 'testhost': check options=0, latency=0.134000
[1327419359.134899] [016.0] [pid=12950] Checking service 'check_ssh' on host 'testhost'...
[1327419359.222060] [016.0] [pid=12950] Attempting to run scheduled check of service 'check_icmp_latency' on host 'testhost': check options=0, latency=0.222000
[1327419359.222085] [016.0] [pid=12950] Checking service 'check_icmp_latency' on host 'testhost'...
[1327419365.212247] [016.1] [pid=12950] Handling check result for service 'check_ssh' on host 'testhost'...
[1327419365.212254] [016.0] [pid=12950] ** Handling check result for service 'check_ssh' on host 'testhost'...
[1327419365.212260] [016.1] [pid=12950] HOST: testhost, SERVICE: check_ssh, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CRITICAL - Timed out connecting to remote host
[1327419365.212290] [016.0] [pid=12950] ** Running async check of host 'testhost'...
[1327419365.212306] [016.0] [pid=12950] Checking host 'testhost'...
[1327419365.220665] [016.0] [pid=12950] Scheduling a non-forced, active check of service 'check_ssh' on host 'testhost' @ Tue Jan 24 15:38:29 2012
[1327419365.223831] [016.1] [pid=12950] Checking service 'check_ssh' on host 'testhost' for flapping...
[1327419365.223854] [016.1] [pid=12950] Checking host 'testhost' for flapping...
[1327419365.265025] [016.1] [pid=12950] Handling check result for service 'check_icmp_latency' on host 'testhost'...
[1327419365.265033] [016.0] [pid=12950] ** Handling check result for service 'check_icmp_latency' on host 'testhost'...
[1327419365.265040] [016.1] [pid=12950] HOST: testhost, SERVICE: check_icmp_latency, CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK: Yes, RETURN CODE: 2, OUTPUT: CRITICAL - 10.11.12.13: rta nan, lost 100%|rta=0.000ms;600.000;1000.000;0; pl=100%;33;66;; \n
[1327419365.265106] [016.0] [pid=12950] ** Running async check of host 'testhost'...
[1327419365.265364] [016.0] [pid=12950] Scheduling a non-forced, active check of service 'check_icmp_latency' on host 'testhost' @ Tue Jan 24 15:38:29 2012
[1327419365.266349] [016.1] [pid=12950] Checking service 'check_icmp_latency' on host 'testhost' for flapping...
[1327419365.266360] [016.1] [pid=12950] Checking host 'testhost' for flapping...
[1327419369.079254] [016.1] [pid=12950] Handling check result for host 'testhost'...
[1327419369.079261] [016.1] [pid=12950] ** Handling async check result for host 'testhost'...
[1327419369.080141] [016.1] [pid=12950] HOST: testhost, ATTEMPT=1/5, CHECK TYPE=ACTIVE, STATE TYPE=HARD, OLD STATE=0, NEW STATE=1
[1327419369.080175] [016.1] [pid=12950] Pre-handle_host_state() Host: testhost, Attempt=1/5, Type=SOFT, Final State=1
[1327419369.080290] [016.1] [pid=12950] Post-handle_host_state() Host: testhost, Attempt=1/5, Type=SOFT, Final State=1
[1327419369.080327] [016.1] [pid=12950] Checking host 'testhost' for flapping...
[1327419369.080368] [016.0] [pid=12950] Scheduling a non-forced, active check of host 'testhost' @ Tue Jan 24 15:37:09 2012
[1327419369.084229] [016.1] [pid=12950] ** Async check result for host 'testhost' handled: new state=1
[1327419369.084266] [016.1] [pid=12950] Handling check result for host 'testhost'...
[1327419369.084273] [016.1] [pid=12950] ** Handling async check result for host 'testhost'...
[1327419369.084348] [016.1] [pid=12950] HOST: testhost, ATTEMPT=1/5, CHECK TYPE=ACTIVE, STATE TYPE=SOFT, OLD STATE=1, NEW STATE=1
[1327419369.084394] [016.1] [pid=12950] Pre-handle_host_state() Host: testhost, Attempt=1/5, Type=HARD, Final State=1
[1327419369.798706] [016.1] [pid=12950] Post-handle_host_state() Host: testhost, Attempt=1/5, Type=HARD, Final State=1
[1327419369.798734] [016.1] [pid=12950] Checking host 'testhost' for flapping...
[1327419369.798781] [016.1] [pid=12950] ** Async check result for host 'testhost' handled: new state=1

------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list