Passive host down result is interpreted as up on master

Ton Voon ton.voon at altinity.com
Mon Mar 19 16:44:35 CET 2007


Hi!

On 16 Mar 2007, at 18:02, Ton Voon wrote:

> I was wondering if anyone has seen this before. On a slave, we have  
> a host that is marked as DOWN with a plugin output of "CRITICAL -  
> Plugin timed out after 10 seconds", as expected. However, on the  
> master, that host is marked as UP with the same text.
>
>
> The logs on the master server, show:
>
> [1174045717] EXTERNAL COMMAND:  
> PROCESS_HOST_CHECK_RESULT;host1;0;PING OK - Packet loss = 0%, RTA =  
> 0.37 ms|
>
> Host is marked as UP. Later on:
>
> [1174045949] EXTERNAL COMMAND:  
> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after  
> 10 seconds|
>
> Failure arrives.
>
> [1174045949] HOST ALERT: host1;DOWN;HARD;1;CRITICAL - Plugin timed  
> out after 10 seconds
>
> Marked it as DOWN with alert. As expected.
>
> [1174045951] Warning: The results of service '/ - partition' on  
> host 'host1' are stale by 24 seconds (threshold=82 seconds).  I'm  
> forcing an immediate check of the service.
> [1174045953] SERVICE ALERT: host1;/ - partition;UNKNOWN;HARD; 
> 1;UNKNOWN: Service results are stale
> [1174045959] EXTERNAL COMMAND:  
> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after  
> 10 seconds|
>
> More passive results
>
> [1174045971] EXTERNAL COMMAND:  
> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after  
> 10 seconds|
>
> And again, but this time...
>
> [1174045973] HOST ALERT: host1;UP;HARD;1;CRITICAL - Plugin timed  
> out after 10 seconds
>
> Nagios has marked the host as UP, even though the  
> PROCESS_HOST_CHECK_RESULT is down.
>
>
> The complete nagios.log around this period is attached. I'm at a  
> lost understanding why this has happened. Has anyone got any clues,  
> or seen something similar?
>
> We haven't been able to reproduce this consistently yet.
>
> This is on Nagios 2.5 (with some local patches).


We think we've found the root problem.

In checks.c, if a host does not have a check_command, there is a  
debug line that says: "No host check command specified, so no check  
will be done (host state assumed to be unchanged)". However, it then  
returns HOST_UP. We have amended this to return hst->current_state  
instead.

In our distributed setup, we define a host without a check_command,  
instead relying on the passive host results sent by the slave.  
However, on the master, if a service on this host passes its  
freshness threshold, a host check is scheduled, with the force flag.  
This then gets to this portion of the code and returns a HOST_UP  
state rather than the current state, thus showing an incorrect state  
for the host.

Our patch is below, made against nagios 2.8.

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios_no_host_check_command_returns_current_state.patch
Type: application/octet-stream
Size: 599 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20070319/7b113f9d/attachment.obj>
-------------- next part --------------

-------------- next part --------------
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list