Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2

Ben Miller bgmiller at nframe.com
Wed Jan 11 23:36:45 CET 2006


Greetings,
I am seeing a strange behavior with Nagios that appears to be a
threading issue.  I have trouble shot this enough to determine that it
may be over my head and have to do with how threads are handled in
Nagios or the libraries it uses.  I believe this to be a code level
issue so I am posting to the devel list vs the user list.  Please
forgive if this is the wrong place.

-Symptom
When I run Nagios it takes about 30 - 60 seconds to load saved state
information such as scheduled down times, etc. and it takes upwards of
60-120 seconds to process external commands.  In addition, the check
queue stacks up because it is only processing one check at a time.  A ps
shows ONLY the main Nagios process, a single child, and that child
spawning the check command.  It appears as if nothing else (external
commands, notifications, etc) is being processed while the one child
task is working.

During troubleshooting, I ran Nagios in an strace to determine what it
was blocking on and I can clearly see that it is stopping during a
"wait4(" on the pid of the checking or alerting child.

I ran an strace -f on nagios to see the full thread flow of what was
happening and Nagios performed perfectly.  The problem went away and
external checks were processed in a few seconds and ps shows a list of
half a dozen or so check or alert child processes.

In addition, when I compile with all debugging turned on and ran Nagios
by itself, the bad behavior was back.  However when I run the debug
executable through strace (with NO -f) the process starts up
excruciatingly slowly, but then runs properly with multiple child
processes and handling external commands properly.

The problem occurs consistently and is easy to replicate.  It occurs
with versions 2.0b3 or rc2.  I have tested both.

-Background
I have been running Nagios with the same version on a different box with
the exact same compile options and config files for months and
everything is working fine.  I am upgrading from a AMD 32 bit system
(RedHat Enterprise v4) to a new box with Dual 64 bit Opterons running
(RedHat Enterprise v4 64bit).

I compile with: ./configure --prefix=/home/nagios/nagios
--with-cgiurl=/nagios/cgi-bin --with-nagios-user=nagios
--with-nagios-group=nagios --with-htmurl=/nagios --with-perlcache
--enable-embedded-perl

It seems that there might be a thread/race/timing issue that is relieved
when there is enough debugging or if strace is involved in the thread
handling.

I can provide more information if there is someone(s) who can help me
resolve this issue.
Thank you in advance.
Ben




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click




More information about the Developers mailing list