coredumps in wobbly networks

Ethan Galstad nagios at nagios.org
Fri Mar 25 00:02:12 CET 2005


Not sure where this is actually happening.  It looks like malloc() is 
to blame - not sure why.  The only malloc() in the 
service_result_worker_thread() routine occurs at line 4736 in 
base/utils.c, which looks ok to me.  

Anyone else have any ideas as to what might be happening?



On 24 Mar 2005 at 12:32, Andreas Ericsson wrote:

> Ahoy.
> 
> I've observed a series of most unfortunate SIGSEGV's in Nagios.
> It appears to happen when service checks pop back to OK on the second
> attempt and then something happens (see logs below).
> 
> Here are two separate log-entries leading up to the crash. They are
> taken from two separate nagios instances on separate machines and, as
> you can see by the timing, both instances occurred on different
> timings (the naglog program used to get human-readable time is
> available at http://oss.op5.se/nagios/naglog.c)
> 
> [ crash 1, on primary server ]
> 2005-03-20 22:11:57: Auto-save of retention data completed
> successfully. 2005-03-20 22:25:56: SERVICE ALERT:
> foo-host;PING;WARNING;SOFT;1;WARNING - x.x.x.x: rta 107 ms, lost 0%
> 2005-03-20 22:26:56: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
> x.x.x.x: rta 1.82 ms, lost 0%
> 
> [ crash 2, on secondary server ]
> 2005-03-21 06:19:41: Auto-save of retention data completed
> successfully. 2005-03-21 06:28:11: SERVICE ALERT:
> foo-host;PING;WARNING;SOFT;1;WARNING - x.x.x.x: rta 234.926ms, lost 0%
> 2005-03-21 06:29:11: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
> x.x.x.x: rta 0.150ms, lost 0%
> 
> 
> Note the "PING;OK;SOFT;2" part. These are the last two log-entries
> before the crash (it's the same host both times, actually) on both
> servers. host check command is standard and there are no problems with
> it.
> 
> It's worth pointing out that this isn't latest CVS, but rather
> whichever one was latest Jan 19 2005. I haven't seen a checkin that
> touches this codesection though, so I believe the bug might still be
> lurking in there somewhere.
> 
> The coredumps for these crashes are largely useless. The backtrace
> points to __glibc_malloc() called from pthread_create().
> pthread_create() is called with a NULL argument, and the coredump
> actually takes place at address 0x0.
> 
> Here's some of the gdb output (I still have binaries and several
> core-files in case anyone's interested in running more commands).
> 
> [ gdb session, core1 ]
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /lib/libm.so.6...done.
> Loaded symbols for /lib/libm.so.6
> Reading symbols from /lib/libnsl.so.1...done.
> Loaded symbols for /lib/libnsl.so.1
> Reading symbols from /lib/libpthread.so.0...done.
> Loaded symbols for /lib/libpthread.so.0
> Reading symbols from /lib/libc.so.6...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> #0  0x00000000 in ?? ()
> (gdb) bt
> #0  0x00000000 in ?? ()
> #1  0x001c100b in __libc_malloc (bytes=512) at malloc.c:2695
> #2  0x080612fe in service_result_worker_thread (arg=0x0) at
> #utils.c:4692 3  0x00162de2 in pthread_start_thread (arg=0xbf5ffe40)
> #at manager.c:241 4  0x0020f70a in thread_start () from /lib/libc.so.6
> (gdb)
> [ end gdb session, core1 ]
> 
> The gdb session for core2 is identical.
> 
> I'll investigate some more during the holidays and see if I can come
> up with a patch for this or at least some means of debugging it a bit
> more easily.
> 
> -- 
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Lead Developer
> 
> 
> -------------------------------------------------------
> This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon
> 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest
> Windows Embedded(r) & Windows Mobile(tm) platforms, applications &
> content.  Register by 3/29 & save $300
> http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
> _______________________________________________ Nagios-devel mailing
> list Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 
> 



Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click




More information about the Developers mailing list