Nagios stop hangs in FUTEX_WAIT

Herbert Straub herbert at linuxhacker.at
Sun Feb 25 20:27:19 CET 2007


Ethan Galstad wrote:
> Strange.  I haven't heard reports of this happening before and I've 
> never encountered this myself.  I run FC4 on my development box, but its 
> a 32-bit machine and it looks like you've got 64-bit hw.  Correct?  I'll 
> try installing FC6 this weekend and see if I can replicate it.
>
> Has this always happened for you, or was there a recent update or some 
> kind that caused this?  Also, how much time passed between using the 
> init script to stop Nagios and the error message appearing?
>
>   
Today i upgrade the installed (and patched) version Nagios 2.6 to Nagios
2.7 using the normal RPM packages (yum update nagios) and start Nagios.
I try to stop the nagios process after three minutes with
/etc/init.d/nagios stop and see:

root at xen1 ~]# /etc/init.d/nagios stop
Stopping network monitor: nagios
Waiting for nagios to exit . . . . . . . . . . .
Warning - running nagios did not exit in time

and ps alxw:

1 100 4058 1 15 0 121428 22732 184466 Ssl ? 0:06 /usr/sbin/nagios -d
/etc/nagios/nagios.cfg
1 100 6942 4058 25 0 0 0 exit Z ? 0:00 [nagios] <defunct>

and strace -p 4058

[root at xen1 ~]# strace -p 4058
Process 4058 attached - interrupt to quit
futex(0x3663d49980, FUTEX_WAIT, 2, NULL <unfinished ...>
Process 4058 detached

I think, this error only happens, if a lot of checks are scheduled. I
create a lot of "virtual services" with this script:

#!/usr/bin/python

vh=open("virt_hosts.cfg","w")
for i in range(5000):
vh.write("define host {\n")
vh.write("use generic-host\n")
vh.write("host_name generichost%s.localdomain\n" % i)
vh.write("alias generichost%s.localdomain\n" %i)
vh.write("}\n\n")
vh.write("define service {\n")
vh.write("use generic-service\n")
vh.write("service_description generic-service%s\n" % i)
vh.write("check_command check_ssh\n")
vh.write("host_name generichost%s.localdomain\n" % i)
vh.write("}\n\n")
vh.close()

I'm using PNP (not really installed on the testmachine) for the Service
Performance Data (nagios.cfg):
service_perfdata_command=process-service-perfdata-pnp

and

define command{
command_name process-service-perfdata-pnp
command_line /usr/local/share/nagios2/eventhandlers/process_perfdata.pl
}


I set the normal_check_intervall to 1 minute in the generic service
definition. I wait around 2 or 3 minutes after the startup. Then i'm
trying to stop nagios. I can reproduce this error situation nearly every
time. Next i download the nagios source rpm nagios-2.7-2.fc6.src.rpm
from the fedora.redhat.com site and build it:

rpmbuild -ba SPEC/nagios.spec

and do yum remove nagios and install the selfmode nagios package with
rpm -ivh RPMS/x86_64/nagios-2.7-2.x86_64.rpm. Next: nagios -v
/etc/nagios/nagios.cfg

Checking services...
Checked 5156 services.
Checking hosts...
...

Total Warnings: 1
Total Errors: 0

and starting Nagios with /etc/init.d/nagios start. After the first stop,
i see the nagios process in FUTEX_WAIT. Now i modify the nagios.spec file:


diff -u SPECS/nagios.spec.2.7 SPECS/nagios.spec
--- SPECS/nagios.spec.2.7 2007-02-25 18:59:00.000000000 +0100
+++ SPECS/nagios.spec 2007-02-25 19:12:26.000000000 +0100
@@ -1,6 +1,6 @@
Name: nagios
Version: 2.7
-Release: 2%{?dist}
+Release: 2%{?dist}hs1
Summary: Host/service/network monitoring program

Group: Applications/System
@@ -10,6 +10,7 @@
Source1: nagios.logrotate
Source2: nagios.htaccess
Patch0: nagios-initrd.patch
+Patch1: nagios-mutex-wait.patch
BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n)

BuildRequires: gd-devel > 1.8, mailx
@@ -53,6 +54,7 @@
%prep
%setup -q
%patch0 -p0
+%patch1 -p0

%build
./configure \
@@ -161,6 +163,9 @@
%{_includedir}/%{name}

%changelog
+* Sun Feb 25 2007 Herbert Straub <herbert at linuxhacker.at> 2.7-2hs1
+- mutex patch
+
* Tue Feb 06 2007 Mike McGrath <imlinux at gmail.com> 2.7-2
- Upstream released 2.7


Next rpm -ivh RPMS/x86_64/nagios-2.7-2hs1.x86_64.rpm and
/etc/init.d/nagios start. After 3 minutes the stop command. I try this
five times and there was no error stopping the process. The SRC rpm
contains the original nagios-2.7.tar.gz - i compare the md5sum:
d664d2785cdca3c5c8a3e84c033e8e6e. I'm testing this on a 64 bit machine
with Fedora Core 6 and 2.6.19-1.2895.fc6xen. I know, that this problem
situation also happens on a 32bit-SMP Fedora Core 4 machine without xen
kernel.

I could be wrong, but is it possible a problem to call syslog() in a
signal hander? Look at the following articles:

Very old, but this hits the same situation:
http://sourceware.org/ml/libc-hacker/2004-06/msg00046.html

Newer and possible the same situation
http://www-gatago.com/comp/mail/imap/27579981.html

Regards
Herbert Straub

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list