coredumps in wobbly networks

Andreas Ericsson ae at op5.se
Thu Mar 24 12:32:11 CET 2005


Ahoy.

I've observed a series of most unfortunate SIGSEGV's in Nagios.
It appears to happen when service checks pop back to OK on the second 
attempt and then something happens (see logs below).

Here are two separate log-entries leading up to the crash. They are 
taken from two separate nagios instances on separate machines and, as 
you can see by the timing, both instances occurred on different timings 
(the naglog program used to get human-readable time is available at 
http://oss.op5.se/nagios/naglog.c)

[ crash 1, on primary server ]
2005-03-20 22:11:57: Auto-save of retention data completed successfully.
2005-03-20 22:25:56: SERVICE ALERT: foo-host;PING;WARNING;SOFT;1;WARNING 
- x.x.x.x: rta 107 ms, lost 0%
2005-03-20 22:26:56: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK - 
x.x.x.x: rta 1.82 ms, lost 0%

[ crash 2, on secondary server ]
2005-03-21 06:19:41: Auto-save of retention data completed successfully.
2005-03-21 06:28:11: SERVICE ALERT: foo-host;PING;WARNING;SOFT;1;WARNING 
- x.x.x.x: rta 234.926ms, lost 0%
2005-03-21 06:29:11: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK - 
x.x.x.x: rta 0.150ms, lost 0%


Note the "PING;OK;SOFT;2" part. These are the last two log-entries 
before the crash (it's the same host both times, actually) on both 
servers. host check command is standard and there are no problems with it.

It's worth pointing out that this isn't latest CVS, but rather whichever 
one was latest Jan 19 2005. I haven't seen a checkin that touches this 
codesection though, so I believe the bug might still be lurking in there 
somewhere.

The coredumps for these crashes are largely useless. The backtrace 
points to __glibc_malloc() called from pthread_create(). 
pthread_create() is called with a NULL argument, and the coredump 
actually takes place at address 0x0.

Here's some of the gdb output (I still have binaries and several 
core-files in case anyone's interested in running more commands).

[ gdb session, core1 ]
Program terminated with signal 11, Segmentation fault.
Reading symbols from /lib/libm.so.6...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libnsl.so.1...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libpthread.so.0...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libc.so.6...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libnss_files.so.2...done.
Loaded symbols for /lib/libnss_files.so.2
#0  0x00000000 in ?? ()
(gdb) bt
#0  0x00000000 in ?? ()
#1  0x001c100b in __libc_malloc (bytes=512) at malloc.c:2695
#2  0x080612fe in service_result_worker_thread (arg=0x0) at utils.c:4692
#3  0x00162de2 in pthread_start_thread (arg=0xbf5ffe40) at manager.c:241
#4  0x0020f70a in thread_start () from /lib/libc.so.6
(gdb)
[ end gdb session, core1 ]

The gdb session for core2 is identical.

I'll investigate some more during the holidays and see if I can come up 
with a patch for this or at least some means of debugging it a bit more 
easily.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click




More information about the Developers mailing list