Multi-Threaded Nagios keeps on truckin?

Steven D. Morrey smorrey at ldschurch.org
Wed Sep 9 22:59:28 CEST 2009


This is odd.
I started Nagios with my threading patched under GDB, ran it for an hour or so, eventually I went to lunch and came back.
Looking at the log and the screen I noticed everything looked normal, but I found this at about 2:30 of runtime

[New Thread 1871707040 (LWP 27875)]
*** glibc detected *** /usr/local/nagios/bin/nagios: double free or corruption (fasttop): 0xb3b17710 ***
======= Backtrace: =========
/lib/libc.so.6[0xb7eb2db2]
/lib/libc.so.6(__libc_free+0x84)[0xb7eb4414]
/usr/local/nagios/bin/nagios(free_memory+0x1e3)[0x807af33]
/usr/local/nagios/bin/nagios(my_system+0x223)[0x80773af]
/usr/local/nagios/bin/nagios(run_host_check+0x2ea)[0x8059966]
/usr/local/nagios/bin/nagios(check_host+0x2db)[0x8058fa4]
/usr/local/nagios/bin/nagios(verify_route_to_host+0x2b)[0x8058904]
/usr/local/nagios/bin/nagios(reap_service_checks+0xac6)[0x8057884]
/lib/libpthread.so.0[0xb7f7b13b]
/lib/libc.so.6(__clone+0x5e)[0xb7f0cfbe]


I would have assumed that the application would have stopped at this point but, it appears to have just shaken it off and continued....
In fact it's still going, it continued as per normal for what i estimate to be over an hour more, spawning 3, 4 even 5 threads to handle the service reaper, but finally it appears that the check results buffer is not being filled by anything anymore because I see reaper threads being spawned and instantly exiting.
Stepping through the reaper process shows that the buffer is empty every time.
Since this is a DNX based setup we are talking about, it would appear that the DNX Collector has gone deaf, but I think it may be something else.
It's possible that the segfault occurred while holding the mutex for the results buffer, thereby preventing the DNX collector from writing to it, however upon examination the mutex appears to be unlocked (__lock = 0)
I'm going to keep looking, but in the meantime if anyone has any other ideas on what I might want to check here, it would be very much appreciated.

I find it very puzzling that a segfault 30 or 40 minutes ago would only now cause the results buffer to remain empty. 
It makes me wonder if we even have a real correlation here or if something else is at play.

Sincerely,
Steve

p.s.  The solution to the segfault issue itself was be to make the free_memory function thread safe, which I have now done.


 NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july




More information about the Developers mailing list