Some hard state changes missing in NDOUtils

Ton Voon ton.voon at altinity.com
Tue Nov 13 14:24:26 CET 2007


Hi!

We've been doing some work to validate the data in NDOUtils and found  
a bug in Nagios and a missing state change entry. This happens when a  
service is in a failed state and changes to a different state at the  
same time that the host is considered down (or unreachable).

DETAIL

These are the servicecheck results in the database:

mysql> select  
start_time,state,state_type,output,current_check_attempt,max_check_attem 
pts from nagios_servicechecks where service_object_id=445 and  
start_time between '2007-11-05 13:40:00' and '2007-11-05 14:00:00';
+---------------------+-------+------------ 
+----------------------------------------------------- 
+-----------------------+--------------------+
| start_time          | state | state_type |  
output                                              |  
current_check_attempt | max_check_attempts |
+---------------------+-------+------------ 
+----------------------------------------------------- 
+-----------------------+--------------------+
| 2007-11-05 13:41:18 |     1 |          1 | DISK WARNING - free  
space: / 1938 MB (10% inode=-): |                     3  
|                  3 |
| 2007-11-05 13:46:18 |     1 |          1 | DISK WARNING - free  
space: / 1939 MB (10% inode=-): |                     3  
|                  3 |
| 2007-11-05 13:51:18 |     2 |          1 | CHECK_NRPE: Socket  
timeout after 10 seconds.        |                     1  
|                  3 |
| 2007-11-05 13:56:18 |     1 |          0 | DISK WARNING - free  
space: / 1939 MB (10% inode=-): |                     1  
|                  3 |
| 2007-11-05 13:57:18 |     1 |          0 | DISK WARNING - free  
space: / 1939 MB (10% inode=-): |                     2  
|                  3 |
| 2007-11-05 13:58:39 |     0 |          1 | DISK OK - free space: /  
2639 MB (14% inode=-):      |                     1  
|                  3 |
+---------------------+-------+------------ 
+----------------------------------------------------- 
+-----------------------+--------------------+

6 rows in set (0.02 sec)


Note that the current_check_attempt is 1/3 for the CRITICAL event at  
13:51:18. This should be 3/3. A side effect of this is that the  
subsequent warning at 13:56:18 is now considered a soft state when it  
should remain as hard.


Looking at the state history table, we get:

mysql> select  
state_time,state,state_type,output,current_check_attempt,max_check_attem 
pts from nagios_statehistory where object_id=445 and state_time  
between '2007-11-05 11:50:00' and '2007-11-05 14:00:00';
+---------------------+-------+------------ 
+----------------------------------------------------- 
+-----------------------+--------------------+
| state_time          | state | state_type |  
output                                              |  
current_check_attempt | max_check_attempts |
+---------------------+-------+------------ 
+----------------------------------------------------- 
+-----------------------+--------------------+
| 2007-11-05 11:51:05 |     1 |          1 | DISK WARNING - free  
space: / 1902 MB (10% inode=-): |                     3  
|                  3 |
| 2007-11-05 13:56:39 |     1 |          0 | DISK WARNING - free  
space: / 1939 MB (10% inode=-): |                     1  
|                  3 |
| 2007-11-05 13:57:19 |     1 |          0 | DISK WARNING - free  
space: / 1939 MB (10% inode=-): |                     2  
|                  3 |
| 2007-11-05 13:58:41 |     0 |          1 | DISK OK - free space: /  
2639 MB (14% inode=-):      |                     3  
|                  3 |
+---------------------+-------+------------ 
+----------------------------------------------------- 
+-----------------------+--------------------+
4 rows in set (0.00 sec)


Note that the state change from warn to critical at 13:51:18 has been  
missed from here.

These are the relevant lines from nagios.log (the first just to show  
that there were no interesting entries before 13:52:07):

Mon Nov  5 13:50:57 2007 SERVICE ALERT: unrelatedhost;TCP/ 
IP;CRITICAL;HARD;1;PING CRITICAL - Packet loss = 100%
Mon Nov  5 13:52:07 2007 HOST ALERT: hostname;DOWN;SOFT;1;PING  
CRITICAL - Packet loss = 100%
Mon Nov  5 13:52:17 2007 HOST ALERT: hostname;DOWN;SOFT;2;PING  
CRITICAL - Packet loss = 100%
Mon Nov  5 13:52:37 2007 HOST ALERT: hostname;UNREACHABLE;HARD;3;PING  
CRITICAL - Packet loss = 100%
Mon Nov  5 13:52:37 2007 SERVICE ALERT: hostname;/home;CRITICAL;HARD; 
1;CHECK_NRPE: Socket timeout after 10 seconds.
Mon Nov  5 13:52:47 2007 SERVICE ALERT: hostname;/;CRITICAL;HARD; 
1;CHECK_NRPE: Socket timeout after 10 seconds.
Mon Nov  5 13:53:56 2007 HOST ALERT: hostname;UP;HARD;1;PING OK -  
Packet loss = 0%, RTA = 33.70 ms
Mon Nov  5 13:56:39 2007 SERVICE ALERT: hostname;/home;WARNING;SOFT; 
1;DISK WARNING - free space: / 1939 MB (10% inode=-):
Mon Nov  5 13:56:39 2007 SERVICE ALERT: hostname;/;WARNING;SOFT; 
1;DISK WARNING - free space: / 1939 MB (10% inode=-):

The log entries show that the service failure is due to a host down  
state.


SOLUTION

Attached is a patch that works for us by changing a few lines in  
checks.c.

We managed to recreate this manually on a pre-patched system and saw  
the same behaviour. After the patch was applied, the state change  
from warn to critical was correctly added into the state history  
table, and the subsequent return to warn was also added as a hard state.

This is on Nagios 2.9 (with Altinity patches) & NDOUtils 1.4b3.

Does this seem valid?

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios_host_failures_cause_incorrect_service_states.patch
Type: application/octet-stream
Size: 1930 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20071113/ffac26c7/attachment.obj>
-------------- next part --------------

-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list