nagios 3.2.3 localtime deadlock

Andreas Ericsson ae at op5.se
Wed Oct 13 11:27:54 CEST 2010


On 10/12/2010 07:44 PM, Matthew Kent wrote:
> On Fri, Oct 8, 2010 at 9:08 AM, Matthew Kent<real.mkent at gmail.com>  wrote:
>> On Fri, Oct 8, 2010 at 3:52 AM, Andreas Ericsson<ae at op5.se>  wrote:
>>> On 10/07/2010 08:43 PM, Matthew Kent wrote:
>>>> Hello all,
>>>>
>>>
>>> Hey you. First of all, thanks for including a backtrace. That's really
>>> neat.
>>>
>>
>> Thanks for looking :)
>>
>>>> Setting up a new nagios 3.2.3 install and occasionally (once in 24
>>>> hours) I'm seeing a child deadlock when calling localtime() like so:
>>>>
>>>> (gdb) bt
>>>> #0  0x00000033d5edfade in __lll_lock_wait_private () from /lib64/libc.so.6
>>>> #1  0x00000033d5e8d1cd in _L_lock_1685 () from /lib64/libc.so.6
>>>> #2  0x00000033d5e8cf17 in __tz_convert () from /lib64/libc.so.6
>>>> #3  0x000000000043e23e in get_datetime_string (raw_time=<value
>>>> optimized out>, buffer=0x2aaab014feb0<incomplete sequence \350>,
>>>> buffer_length=48, type=0) at utils.c:1696
>>>> #4  0x0000000000430990 in grab_datetime_macro (macro_type=7, arg1=0x0,
>>>> arg2=0x0, output=0x6998f8) at ../common/macros.c:1533
>>>> #5  0x0000000000432cbf in grab_macrox_value (macro_type=-4, arg1=0x0,
>>>> arg2=0x0, output=0x6998f8, free_macro=0x2) at ../common/macros.c:1089
>>>> #6  0x0000000000433586 in set_macrox_environment_vars (set=1) at
>>>> ../common/macros.c:3166
>>>> #7  0x00000000004335bb in set_all_macro_environment_vars (set=1) at
>>>> ../common/macros.c:3134
>>>> #8  0x000000000041b4c3 in run_async_service_check (svc=0x8d62560,
>>>> check_options=<value optimized out>, latency=<value optimized out>,
>>>> scheduled_check=1, reschedule_check=1,
>>>>       time_is_valid=<value optimized out>, preferred_time=<value
>>>> optimized out>) at checks.c:658
>>>> #9  0x000000000041d56d in run_scheduled_service_check (svc=0x8d62560,
>>>> check_options=0, latency=0.68999999999999995) at checks.c:260
>>>> #10 0x000000000042a45a in handle_timed_event (event=0x2aaab011af30) at
>>>> events.c:1257
>>>> #11 0x000000000042abe6 in event_execution_loop () at events.c:1143
>>>> #12 0x0000000000413055 in main (argc=<value optimized out>,
>>>> argv=<value optimized out>, env=0x7fffa0670758) at nagios.c:850
>>>>
>>>> this leads to Nagios being completely frozen until I manually kill the child.
>>>>
>>>
>>>
>>> Looking at the glibc code, I see no possible way that a single thread
>>> can hold on to the lock in __tz_convert() for any extended period of
>>> time. What version of glibc are you using?
>>>
>>
>> glibc-2.5-49.el5_5.4.x86_64
>>
>>>> Some light Googling tells me this can happen with localtime in certain
>>>> cases, but I see no indication of other people with this issue in
>>>> Nagios.
>>>>
>>>
>>> Since this seems to happen in the codepath that exports macros as
>>> environment variables, I'd like to know if it happens if you turn
>>> that stuff off. Unless you really, really need it it's a good idea
>>> to do that anyways, since computing a bazillion macros each time
>>> Nagios runs a check is quite expensive. Set
>>>
>>>   use_large_installation_tweaks=1
>>> or
>>>   enable_environment_macros=0
>>>
>>> in your nagios.cfg file.
>>>
>>> use_large_installation_tweaks=1 is a really good idea anyways unless
>>> you're running Nagios on Windows 95, where a process' used memory
>>> was never reclaimed by the system unless manually free()'d.
>>>
>>
>> Yeah we don't even use the environment variables. Thanks for all the info.
>>
>>>> It's a pretty standard Nagios install on CentOS 5.5 - except for the
>>>> fact I'm using the mk-livestatus event broker. We have a couple
>>>> thousand checks configured on a pretty aggressive interval.
>>>>
>>>
>>> First try disabling environment macros. Then try without the
>>> mk-livestatus module. Seeing it happen in a pristine Nagios would mean
>>> we don't need to speculate about where the problem happens.
>>
>> Good call, I'll disable the env macros and run it over the weekend,
>> then reenable them and with livestatus off for good measure and report
>> back here. We'll see what happens!
>>
> 
> Oops, this was originally supposed to go to the list.
> 
> For the record setting
> 
> enable_environment_macros=0
> 
> did indeed prevent the issue from reoccurring over a period 72 hours.
> 
> Let me know if I can be of further assistance with this issue.
> 

Well, I'll have a patch ready before 2010-11-04, I hope. When that
happens, feel free to download Nagios from git.op5.org and use the
patched version there, with enable_environment_macros=1 set in the
config, although you should probably leave them disabled in production
since the option does require quite a lot of cycles for each check
you run.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Beautiful is writing same markup. Internet Explorer 9 supports
standards for HTML5, CSS3, SVG 1.1,  ECMAScript5, and DOM L2 & L3.
Spend less time writing and  rewriting code and more time creating great
experiences on the web. Be a part of the beta today.
http://p.sf.net/sfu/beautyoftheweb




More information about the Developers mailing list