Unexplained nagios crashes

Duncan Ferguson duncan.ferguson at altinity.com
Wed Aug 15 17:45:22 CEST 2007


Hiya Ethan, list.

We are hoping someone may be able to help diagnose what is going on  
with an obscure problem we have.  After going cross-eyed from looking  
at this over the last few weeks I thought it best to see if anyone  
else has seen/experienced the same thing.

We have a single customer that has been suffering sporadic nagios  
daemon crashes since June - nothing is unique about their set up that  
we have been able to find and other customers have the exact same  
binaries (and distributed setup with same number of slaves) on the  
same OS and have had no crashes in the same period of time.

Salient points:
* this is using a patched nagios 2.8 binary, a patched 1.4b2 ndoutils  
broker module and an in house broker module
* the crashes are intermittent and irregular, at no fixed time of  
day. Might have three crashes one day, then nothing for two days,  
then one crash a day for four days
* Studying the core dump, the code bombs out in  
commands.c:process_passive_service_checks while transversing the  
passive_check_result_list linked list

We have added in a bit of extra code to print out the entire  
passive_check_result_list structure before the fork, and from what we  
can see in the core dump the list is corrupted mid way through - the  
last readable record has a 'next' pointing to what looks like a valid  
area of memory, but nothing is there, but  
passive_check_result_list_tail has a valid entry which implies  
everything was added into the list OK in the first place.

So between being added into the linked list and being read from the  
linked list a record is removed.  The list has well below maximum  
number of buffer slots so lack of memory isnt the problem (else the  
tail entry would also be screwed).

We have been unable to find any code that would cause this behavior  
(especially when the list is confined to commands.c), especially when  
this section is called and used as often as it is and the crashes few  
and far between (in comparison).

The nagios binary has been compiled with "-ggdb -O0" for debugging  
purposes and is running on Debian Etch i386 with 4x Intel Xeon 1.86Hz  
cpu's and 4Gb of memory.  The core dump, nagios binary and commands.c  
is available at http://resources.opsview.org/nagios_crash.tar.gz

Any insight or help would be appreciated.

   Duncs

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/




More information about the Developers mailing list