Nagios FIFO bug - out of order service results

Sergio Freire sergio-s-freire at ptinovacao.pt
Sat Mar 12 21:03:14 CET 2005


I posted this on Nagios Devel Mailing list but i dont know if it going
to be accepted. I also contacted you long time ago concerning this bug
although I did not any a clear idea of what was happening. I hope btw
that everything OK with you. Nice slides from FOSDEM, pitty I missed but
my company dont spend many founds in this..


Hi.
Im using Nagios for long time and there is a bug that persists for many
time.
I have services which are only passive and basically they reflect a some
script that runs periodicaly. When the script starts it puts the service
in an UNKNOWN state and when it finishes it puts it in OK.
Sometimes the OK state was not processed.. well i went back into the
source code and i have put some debug which logs into syslog and I found
that even though the external commands are sent in the correct order
(first the Unknown and then later the OK state) sometimes Nagios reads
the service check results from the FIFO in incorrect order and not as
expected in a FIFO.

What I did was writing to syslog "###write_svc" everytime
write_svc_message is called (which happens when a passive check result
is submited) and write to syslog "###read_svc" when Nagios reads the
service check result from the FIFO.
I also write "###reap_checkresults" when Nagios processes a check result
(which should happen after a read_svc_message from the FIFO).

What I was expecting was seeing Nagios process the results as in a FIFO
(older/first result is processed first)  and not like a LIFO (most
recent service result is processed first).
This happens most of the times but not always!!

This is clear to see in the few log lines bellow:


Mar 12 19:26:20 bill2 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bill2;BOASV;3;BOASV STARTED
Mar 12 19:26:20 bill2 nagios: ###write_svc: service 'BOASV' on host
'bill2' | ret=3 | out=BOASV STARTED .
Mar 12 19:26:25 bill2 nagios: EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;bill2;BOASV;0;BOASV STOP
Mar 12 19:26:25 bill2 nagios: ###write_svc: service 'BOASV' on host
'bill2' | ret=0 | out=BOASV STOP .

Mar 12 19:26:35 bill2 nagios: ###read_svc: service 'BOASV' on host
'bill2' | ret=0 | out=BOASV STOP .
Mar 12 19:26:35 bill2 nagios: ###reap_checkresults: service 'BOASV' on
host 'bill2' | ret=0 | out=BOASV STOP .

Mar 12 19:26:35 bill2 nagios: ###read_svc: service 'BOASV' on host
'bill2' | ret=3 | out=BOASV STARTED .
Mar 12 19:26:35 bill2 nagios: ###reap_checkresults: service 'BOASV' on
host 'bill2' | ret=3 | out=BOASV STARTED .


This happened on a RedHat 7.3 kernel 2.4.20-20.7smp with Nagios 1.2 but
I also saw it happening on RHAS 3.0. We have many different plataforms
with Nagios deployed but they are all RedHat based.
I have seen this problem since Nagios 1.0 but I also tried the latest
CVS 1.x branch and I am still seeing it happening.


Anyone, Ethan, any ideas of whats happening?
There is an obvious problem with the FIFO which olds the service check
results but is it due to some OS bug or some incorrect use/initalization
of the FIFO in the source code?


Regards,
Sergio Freire


-- 
Sergio Freire
Serviços e Redes Móveis 
PT Inovação, SA
Rua Eng. José Ferreira Pinto Basto
3810 - 106 Aveiro - Portugal
Tel.   +351 234403609
WwW    http://www.ptinovacao.pt
Jabber sergio at im.ptinovacao.pt
Blog   http://blog.globalpt.net/nelito
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20050312/524ebcbb/attachment.sig>


More information about the Developers mailing list