Unexplained nagios crashes

Steffen Poulsen step at tdc.dk
Tue Aug 21 15:54:31 CEST 2007


 

> -----Oprindelig meddelelse-----
> Fra: nagios-devel-bounces at lists.sourceforge.net 
> [mailto:nagios-devel-bounces at lists.sourceforge.net] På vegne 
> af Andreas Ericsson
> Sendt: 21. august 2007 10:45
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Unexplained nagios crashes
> 
> What thread-library is the customer using (make, model, 
> version, everything...)?
> What's the uname -a output?
> If Linux, which scheduler is being used in the kernel?
> 
> 
> 
> Duncan Ferguson wrote:
> > Hiya Ethan, list.
> > 
> > We are hoping someone may be able to help diagnose what is going on 
> > with an obscure problem we have.  After going cross-eyed 
> from looking 
> > at this over the last few weeks I thought it best to see if anyone 
> > else has seen/experienced the same thing.
> > 
> > We have a single customer that has been suffering sporadic nagios 
> > daemon crashes since June - nothing is unique about their 
> set up that 
> > we have been able to find and other customers have the exact same 
> > binaries (and distributed setup with same number of slaves) on the 
> > same OS and have had no crashes in the same period of time.
> > 
> > Salient points:
> > * this is using a patched nagios 2.8 binary, a patched 
> 1.4b2 ndoutils 
> > broker module and an in house broker module
> > * the crashes are intermittent and irregular, at no fixed 
> time of day. 
> > Might have three crashes one day, then nothing for two 
> days, then one 
> > crash a day for four days
> > * Studying the core dump, the code bombs out in 
> > commands.c:process_passive_service_checks while transversing the 
> > passive_check_result_list linked list
> > 
> > We have added in a bit of extra code to print out the entire 
> > passive_check_result_list structure before the fork, and 
> from what we 
> > can see in the core dump the list is corrupted mid way 
> through - the 
> > last readable record has a 'next' pointing to what looks 
> like a valid 
> > area of memory, but nothing is there, but 
> > passive_check_result_list_tail has a valid entry which implies 
> > everything was added into the list OK in the first place.
> > 
> > So between being added into the linked list and being read from the 
> > linked list a record is removed.  The list has well below maximum 
> > number of buffer slots so lack of memory isnt the problem (else the 
> > tail entry would also be screwed).
> > 
> > We have been unable to find any code that would cause this behavior 
> > (especially when the list is confined to commands.c), 
> especially when 
> > this section is called and used as often as it is and the 
> crashes few 
> > and far between (in comparison).
> > 
> > The nagios binary has been compiled with "-ggdb -O0" for debugging 
> > purposes and is running on Debian Etch i386 with 4x Intel 
> Xeon 1.86Hz 
> > cpu's and 4Gb of memory.  The core dump, nagios binary and 
> commands.c 
> > is available at http://resources.opsview.org/nagios_crash.tar.gz
> > 
> > Any insight or help would be appreciated.
> > 
> >    Duncs
> > 
> > 
> ----------------------------------------------------------------------
> > --- This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems?  Stop.
> > Now Search log events and configuration files using AJAX 
> and a browser.
> > Download your FREE copy of Splunk now >>  http://get.splunk.com/ 
> > _______________________________________________
> > Nagios-devel mailing list
> > Nagios-devel at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 
> 
> -- 
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
> 
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and 
> a browser.
> Download your FREE copy of Splunk now >>  
> http://get.splunk.com/ _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/




More information about the Developers mailing list