[Nagios-users] external commands and segfault -- again

Ethan Galstad nagios at nagios.org
Mon Jan 8 23:33:42 CET 2007


Andreas Ericsson wrote:
> bobi at netshel.net wrote:
>> Hey Fellow Nagios-ites:
>>
>> I've been having this *exact* same segfault problem for the last couple o'
>> months.
>>
>> And, after looking at David's stack trace output, it is segfaulting for
>> him in the exact same way/place as it is for me.
>>
>> Here's what I've found:
>>
>> The core dump's that I've examined are all segfaulting when handling the
>> expiration of a scheduled downtime.
>>
>> Since David's stack trace looks identical to mine, I don't think it is in
>> the external command processing, as he believes, but it is in the downtime
>> expiration handling, as well.
>>
>> Having examined about a dozen of these identical core dumps, I see that it
>> is a corruption of the entire sheduled_downtime structure that is being
>> passed into the handled_scheduled_downtime() function.
>>
>> The handled_scheduled_downtime() function is being invoked by the high
>> priority event processing logic in the event_execution_loop().  So it
>> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
>> priority event list, and then hands it to handle_timed_event(), which in
>> turns invoke the handle_scheduled_downtime() routine to handle the
>> expiration of the specified downtime event.
>>
>> The problem is, the scheduled_downtime structure is already corrupted
>> while sitting in the high_priority list - well before it is dequeued by
>> the event_execution_loop() logic.
>>
>> I've walked the high priority list in memory with gdb to examine other
>> timed_event structures and have noticed that only the scheduled_downtime
>> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
>> affected by the memory corruption.  In fact, one time, I found nine
>> scheduled downtime expiration event sequentially listed in the high
>> priority list and the first three had their scheduled_downtime structures
>> corrupted and the remaining six were in pristine condition.
>>
>>
>> So, I've narrowed it down to a couple of possibilities (feel free to add
>> your own!):
>>
>> 1. The scheduled_downtime structure is already corrupted when it is being
>> added to the high priority timed event scheduling list, or
>>
>>
>> 2. The scheduled_downtime structure is OK when it is added to the high
>> priority list, but perhaps a bad pointer access is overwriting it with
>> garbage at some other point in the program.  This would might be somewhat
>> painful to track down.
>>
>>
>> Of the two, I suspect that the second one is the more likely candidate.
>>
> 
> I think the first, as it only happens with scheduled downtime stuff. 
> Otherwise you'd see it on other high-prio events as well (unless you're 
> extremely unlucky each time the crash happens).
> 
>> Some other notes:
>>
>> 1. The timed event expirations that segfault Nagios seem to be "randomly"
>> chosen.
>>
>> We have some regularly submitted (via cron) scheduled downtimes that will
>> work fine for weeks, and then one of them will come up for expiration and
>> trigger this scheduled-downtime-expiration bug.  I've also seen it happen
>> with ad-hoc scheduled downtime submissions via the CGI interface.
>>
>> I've seen it happen with "regular" scheduled downtimes as well as the new
>> "triggered" scheduled downtime.  We thought it might have been related to
>> the new triggered downtime, since that was one of the first events causing
>> a segfault.  But then after eliminating the use of triggered downtimes
>> altogether, the segfaults still occur with the regular scheduled downtime
>> expirations.
>>
>> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6.  So, "upgrading"
>> hasn't gotten rid of it.
>>
>> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
>> x86-64, Kernel 2.6.5-7.267-smp
>>
> 
> This is the culprit, I guess. As this isn't a widespread problem, I 
> wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is 
> fairly ancient too, but that shouldn't matter as this is the only app 
> you're seeing it in).
> 
> I'm guessing this actually is an SMP-system and that SuSE doesn't 
> install SMP kernels on all systems, correct? If so, this could also be a 
> source of problem for you. Nagios doesn't follow the pthread guidelines 
> very closely and does some pretty inappropriate things post-fork() for 
> being a threaded application. This could be one of those problems that 
> doesn't happen on single-cpu systems because the only cpu doesn't have 
> anything to compete with when racing for the memory.
> 
> 
>> 4. We don't have any other segfault problems with other other apps on this
>> system.
>>
>>
>> So I'm still trying figure out *what* is overwriting the
>> scheduled_downtime structures with garbage in memory.
>>
>> Any ideas, based upon this additional information?
>>
> 
> Upgrade glibc and the kernel and pray. Other than that, I guess running 
> it in valgrind and/or gdb for a long period of time or chucking 
> assert()'s and printf()'s at the Nagios code and seeing where it breaks 
> is the only solution.
> 
> 
> btw, thanks for the nicely detailed problem report.
> 
> 

Hmmmm... this is not good.  I just looked through the source code and 
found a bug that looks like it could be the cause of the problem. There 
are actually two potential segfault scenarios that I found are they have 
been around for a long time...

1. If a scheduled downtime entry is manually deleted/cancelled, the 
corresponding event in the event queue is not removed.  The event item 
still contains a pointer to the (now deleted) downtime entry.  This can 
cause a segfault.

2. There was another code segment in downtime.c where when a downtime 
entry was deleted, it was deleted and then later referenced when Nagios 
searched through other downtime entries to see if they were triggered by 
the original (deleted) downtime.  Why this hasn't caused segfaults every 
time a downtime entry is deleted is beyond me.

At any rate, I have just posted a patch to the 2.x branch of CVS.  The 
patch changes the way scheduled downtime is referenced from the event 
queue.  Instead of storing a pointer to the downtime data struct, the 
downtime id number is now used instead.  The timed event handler will 
search for a downtime entry matching the id before it does anything.  If 
the downtime was already deleted, its okay.  Give it a try and see if 
things improve.

Unfortunately, this patch will now break the ndoutils addon (yesterday's 
release, as well as earlier revisions).  I'll get a patch in CVS shortly 
to fix this.  Thanks for the great problem description!



Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list