Out of Order Passive Check Results

brian.boysen at colinx.com brian.boysen at colinx.com
Mon Jun 14 20:50:21 CEST 2004


I have a problem with Passive Checks timing out.

>From nagios.log, the services appear to have result commands still coming 
into the system. The difference in timestamps from the last message to the 
timeout message is way too low for the service to timeout. I thought that 
the commands were taking too long to get through the IPC and get 
registered, but that isn't the case (or at least when I send through a 
message with an invalid host and service the warning message only takes a 
couple of seconds to appear in the log).
By looking at the source code, the first thing I found out is that the 
timestamp in the nagios.log
[1087236197] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;<hostname>;<service name>;0;<message>
does not correspond to the time used for the freshness check. I prepended 
the timestamp for the result to the message to see those times and print 
the timestamp at the check. I also added print statement of the timestamps 
in the write_svc_msg and read_svc_msg to track their progress through the 
system.
I've got one instance where a message with a much earlier timestamp 
appears after several others. There's the section of nagios.log

[1087217984] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087217971Keep Alive
[1087218044] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218034Keep Alive
[1087218105] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218098Keep Alive
[1087218164] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218161Keep Alive
[1087218282] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;vsc;0;1087218224Keep Alive
[1087218343] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218287Keep Alive
[1087218403] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218350Keep Alive
[1087218464] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218413Keep Alive
[1087218534] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;host;svc;0;1087218476Keep Alive
[1087218551] Warning: The results of service 'svc' on host 'host' are 
stale by 80 seconds (threshold=500 seconds).  I'm forcing an immediate 
check of the service.

I did a grep -n of the stdout for each of the timestamps, and here's the 
result.

9872:write 1087217971
9896:read 1087217971
10478:checking 1087217971 at 1087218551
8379:write 1087218034
8405:read 1087218034
8549:checking 1087218034 at 1087218251
8720:checking 1087218034 at 1087218311
9164:write 1087218098
9165:write 1087218098
9166:write 1087218098
9167:write 1087218098
9173:read 1087218098
9174:read 1087218098
9176:read 1087218098
9177:read 1087218098
9197:checking 1087218098 at 1087218491
9202:checking 1087218098 at 1087218491
9220:checking 1087218098 at 1087218491
8214:write 1087218161
8238:read 1087218161
8260:checking 1087218161 at 1087218191
8484:write 1087218161
8555:read 1087218161
8789:write 1087218161
8798:read 1087218161
8885:checking 1087218161 at 1087218371
8645:write 1087218224
8681:write 1087218224
8726:read 1087218224
8727:read 1087218224
8852:write 1087218287
8853:write 1087218287
8901:read 1087218287
8930:read 1087218287
9051:checking 1087218287 at 1087218431
9269:write 1087218413
9281:read 1087218413
10500:write 1087218476
10515:read 1087218476

You can see that 7971 (the cause of the problem) get written at line 9872 
and read at line 9896 well after everyone, but 1087218476. That actually 
gets written after the timeout occurs. This is an issue right. It's always 
possible I'm delusional, I've been trying to figure out this problem for 3 
days straight.

Has anyone else seen something like this?
(I'm currently working on getting this to consistently reproduce, but it's 
bear.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20040614/9f978160/attachment.html>


More information about the Users mailing list