Passive host down result is interpreted as up on master

Ethan Galstad nagios at nagios.org
Mon Mar 19 20:28:44 CET 2007


Ton Voon wrote:
> Hi!
> 
> On 16 Mar 2007, at 18:02, Ton Voon wrote:
> 
>> I was wondering if anyone has seen this before. On a slave, we have a 
>> host that is marked as DOWN with a plugin output of "CRITICAL - Plugin 
>> timed out after 10 seconds", as expected. However, on the master, that 
>> host is marked as UP with the same text.
>>
>>
>> The logs on the master server, show:
>>
>> [1174045717] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;host1;0;PING 
>> OK - Packet loss = 0%, RTA = 0.37 ms|
>>
>> Host is marked as UP. Later on:
>>
>> [1174045949] EXTERNAL COMMAND: 
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10 
>> seconds|
>>
>> Failure arrives.
>>
>> [1174045949] HOST ALERT: host1;DOWN;HARD;1;CRITICAL - Plugin timed out 
>> after 10 seconds
>>
>> Marked it as DOWN with alert. As expected.
>>
>> [1174045951] Warning: The results of service '/ - partition' on host 
>> 'host1' are stale by 24 seconds (threshold=82 seconds).  I'm forcing 
>> an immediate check of the service.
>> [1174045953] SERVICE ALERT: host1;/ - 
>> partition;UNKNOWN;HARD;1;UNKNOWN: Service results are stale
>> [1174045959] EXTERNAL COMMAND: 
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10 
>> seconds|
>>
>> More passive results
>>
>> [1174045971] EXTERNAL COMMAND: 
>> PROCESS_HOST_CHECK_RESULT;host1;1;CRITICAL - Plugin timed out after 10 
>> seconds|
>>
>> And again, but this time...
>>
>> [1174045973] HOST ALERT: host1;UP;HARD;1;CRITICAL - Plugin timed out 
>> after 10 seconds
>>
>> Nagios has marked the host as UP, even though the 
>> PROCESS_HOST_CHECK_RESULT is down.
>>
>>
>> The complete nagios.log around this period is attached. I'm at a lost 
>> understanding why this has happened. Has anyone got any clues, or seen 
>> something similar?
>>
>> We haven't been able to reproduce this consistently yet.
>>
>> This is on Nagios 2.5 (with some local patches).
> 
> 
> We think we've found the root problem.
> 
> In checks.c, if a host does not have a check_command, there is a debug 
> line that says: "No host check command specified, so no check will be 
> done (host state assumed to be unchanged)". However, it then returns 
> HOST_UP. We have amended this to return hst->current_state instead.
> 
> In our distributed setup, we define a host without a check_command, 
> instead relying on the passive host results sent by the slave. However, 
> on the master, if a service on this host passes its freshness threshold, 
> a host check is scheduled, with the force flag. This then gets to this 
> portion of the code and returns a HOST_UP state rather than the current 
> state, thus showing an incorrect state for the host.
> 
> Our patch is below, made against nagios 2.8.
> 
> Ton
> 

Good catch!  I'll get this into CVS pronto.


Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list