Unexplained nagios crashes

Andreas Ericsson ae at op5.se
Tue Aug 21 10:45:11 CEST 2007


What thread-library is the customer using (make, model, version, everything...)?
What's the uname -a output?
If Linux, which scheduler is being used in the kernel?



Duncan Ferguson wrote:
> Hiya Ethan, list.
> 
> We are hoping someone may be able to help diagnose what is going on  
> with an obscure problem we have.  After going cross-eyed from looking  
> at this over the last few weeks I thought it best to see if anyone  
> else has seen/experienced the same thing.
> 
> We have a single customer that has been suffering sporadic nagios  
> daemon crashes since June - nothing is unique about their set up that  
> we have been able to find and other customers have the exact same  
> binaries (and distributed setup with same number of slaves) on the  
> same OS and have had no crashes in the same period of time.
> 
> Salient points:
> * this is using a patched nagios 2.8 binary, a patched 1.4b2 ndoutils  
> broker module and an in house broker module
> * the crashes are intermittent and irregular, at no fixed time of  
> day. Might have three crashes one day, then nothing for two days,  
> then one crash a day for four days
> * Studying the core dump, the code bombs out in  
> commands.c:process_passive_service_checks while transversing the  
> passive_check_result_list linked list
> 
> We have added in a bit of extra code to print out the entire  
> passive_check_result_list structure before the fork, and from what we  
> can see in the core dump the list is corrupted mid way through - the  
> last readable record has a 'next' pointing to what looks like a valid  
> area of memory, but nothing is there, but  
> passive_check_result_list_tail has a valid entry which implies  
> everything was added into the list OK in the first place.
> 
> So between being added into the linked list and being read from the  
> linked list a record is removed.  The list has well below maximum  
> number of buffer slots so lack of memory isnt the problem (else the  
> tail entry would also be screwed).
> 
> We have been unable to find any code that would cause this behavior  
> (especially when the list is confined to commands.c), especially when  
> this section is called and used as often as it is and the crashes few  
> and far between (in comparison).
> 
> The nagios binary has been compiled with "-ggdb -O0" for debugging  
> purposes and is running on Debian Etch i386 with 4x Intel Xeon 1.86Hz  
> cpu's and 4Gb of memory.  The core dump, nagios binary and commands.c  
> is available at http://resources.opsview.org/nagios_crash.tar.gz
> 
> Any insight or help would be appreciated.
> 
>    Duncs
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >>  http://get.splunk.com/
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel


-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/




More information about the Developers mailing list