Nagios stop hangs in FUTEX_WAIT

Ethan Galstad nagios at nagios.org
Thu Feb 22 22:01:37 CET 2007


Herbert Straub wrote:
> If i try to stop nagios with /etc/init.d/nagios stop on Fedora Core 4/6
> with Nagios 2.4 and 2.7 the message:|
> 
> Warning - running nagios did not exit in time|
> ||
> |The nagios process hangs in futex wait - example:|
> ||
> ||
> root at xen1 ~]# strace -p 11620
> Process 11620 attached - interrupt to quit
> futex(0x2aaaabf15980, FUTEX_WAIT, 2, NULL
> 
> This happens not every stop, but 60% of the stop tries. I build nagios
> with debugging info and attach to the hanging process with gdb and see
> three threads with the following stack trace:
> 
> thread 1:
> 
>     #0  0x0000003663ad9298 in __lll_mutex_lock_wait () from /lib64/libc.so.6
>     #1  0x0000003663a730e8 in _L_lock_14830 () from /lib64/libc.so.6
>     #2  0x0000003663a723ab in realloc () from /lib64/libc.so.6
>     #3  0x0000003663a66224 in _IO_mem_finish () from /lib64/libc.so.6
>     #4  0x0000003663a5e2ef in fclose@@GLIBC_2.2.5 () from /lib64/libc.so.6
>     #5  0x0000003663ac9bf1 in __vsyslog_chk () from /lib64/libc.so.6
>     #6  0x0000003663aca120 in syslog () from /lib64/libc.so.6
>     #7  0x0000000000424227 in write_to_syslog (buffer=0x7fffa9aaaeb0 "Caught SIGTERM, shutting down...\n", data_type=64) at logging.c:229
>     #8  0x00000000004248c9 in write_to_all_logs (buffer=0x7fffa9aaaeb0 "Caught SIGTERM, shutting down...\n", data_type=64) at logging.c:123
>     #9  0x000000000042b09e in sighandler (sig=<value optimized out>) at utils.c:3410
>     #10 <signal handler called>
>     #11 0x0000003663a94809 in fork () from /lib64/libc.so.6
>     #12 0x000000000042f8b2 in my_system (cmd=0x7fffa9aac6b0 "/usr/local/share/nagios2/eventhandlers/process_perfdata.pl", timeout=5, early_timeout=0x7fffa9aacebc, exectime=0x7fffa9aaceb0, output=0x0, output_length=0) at utils.c:2699
>     #13 0x00000000004536a3 in xpddefault_run_service_performance_data_command (svc=0x14672c0) at ../xdata/xpddefault.c:469
>     #14 0x0000000000453729 in xpddefault_update_service_performance_data (svc=0x1200011) at ../xdata/xpddefault.c:400
>     #15 0x0000000000453305 in update_service_performance_data (svc=0x1200011) at perfdata.c:91
>     #16 0x0000000000413855 in reap_service_checks () at checks.c:1396
>     #17 0x0000000000421ad2 in handle_timed_event (event=0x778c30) at events.c:1254
>     #18 0x0000000000421e73 in event_execution_loop () at events.c:965
>     #19 0x000000000040efa7 in main (argc=<value optimized out>, argv=<value optimized out>, env=0x7fffa9aae280) at nagios.c:710
> 
> 
> |thread 2:
> |
> 
>     #0  0x0000003663ac4a36 in poll () from /lib64/libc.so.6
>     #1  0x0000000000429ace in service_result_worker_thread (arg=<value optimized out>) at utils.c:4775
>     #2  0x0000003664606305 in start_thread () from /lib64/libpthread.so.0
>     #3  0x0000003663acd50d in clone () from /lib64/libc.so.6
> 
> thread 3:
>     #0  0x0000003663ac6ac2 in select () from /lib64/libc.so.6
>     #1  0x000000000042996e in command_file_worker_thread (arg=<value optimized out>) at utils.c:4943
>     #2  0x0000003664606305 in start_thread () from /lib64/libpthread.so.0
>     #3  0x0000003663acd50d in clone () from /lib64/libc.so.6
> 
> Source part of thread 1:
>           else if(sig<16){
> 
>                 sigshutdown=TRUE;
> 
>                 sprintf(temp_buffer,"Caught SIG%s, shutting down...\n",sigs[sig]);
>           --->  write_to_all_logs(temp_buffer,NSLOG_PROCESS_INFO);
> 
> Source part of thread 2:
>         while(1){
> 
>                 /* should we shutdown? */
>                 pthread_testcancel();
> 
>                 /* wait for data to arrive */
>                 /* select seems to not work, so we have to use poll instead */
>                 pfd.fd=ipc_pipe[0];
>                 pfd.events=POLLIN;
>            ---> pollval=poll(&pfd,1,500);
> 
> Source part of thread 3:
>           while(1){
> 
>                 /* should we shutdown? */
>                 pthread_testcancel();
> 
>                 /**** POLL() AND SELECT() DON'T SEEM TO WORK ****/
>                 /* wait a bit */
>                 tv.tv_sec=0;
>                 tv.tv_usec=500000;
>          --->   select(0,NULL,NULL,NULL,&tv);
> 
>                 /* should we shutdown? */
> 
> 
> Next i remove the the call of write_to_all_logs in the signal handler routine:
> 
>   --- base/utils.c.orig   2007-02-05 21:16:13.000000000 +0100
>     +++ base/utils.c        2007-02-05 21:11:02.000000000 +0100
>     @@ -3406,8 +3406,10 @@
> 
>                     sigshutdown=TRUE;
> 
>     +               /* Straub
>                     sprintf(temp_buffer,"Caught SIG%s, shutting down...\n",sigs[sig]);
>                     write_to_all_logs(temp_buffer,NSLOG_PROCESS_INFO);
>     +               */
> 
>      #ifdef DEBUG2
>                     printf("%s\n",temp_buffer);
> 
> 
> Now, the Nagios stop works every time. My question: Is this a known  or new situation - or only on my system?
> 
> Regards
> Herbert Straub
> 

Strange.  I haven't heard reports of this happening before and I've 
never encountered this myself.  I run FC4 on my development box, but its 
a 32-bit machine and it looks like you've got 64-bit hw.  Correct?  I'll 
try installing FC6 this weekend and see if I can replicate it.

Has this always happened for you, or was there a recent update or some 
kind that caused this?  Also, how much time passed between using the 
init script to stop Nagios and the error message appearing?


Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list