Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2

Ben Miller bgmiller at nframe.com
Thu Jan 12 03:34:51 CET 2006


Andreas,
Thank you for your insight!

> I'm not sure, but it's most likely due to one of two reasons;
>
> * A plugin that's being run is stuck in uninterruptable IO. This can 
> happen when you're trying to check a partition residing on a network 
> mounted media where the network connection for some reason is down. It

> can also happen under spurious circumstances where a process with
higher 
> priority is holding a lock on some resource that the plugin is trying
to 
> use.
>
> * There's a bug in Nagios causing it to hold a mutex in one of the 
> parents' threads that isn't released before the child is spawned, so
the 
> child inherits the mutex but has no way of releasing it. I know for a 
> fact that Nagios does things considered illegal for multithreaded 
> programs after fork()'ing, so this might be it. It should work well 
> under Linux with reasonably up-to-date libraries and kernel though,
but...
>

I did leave out a valuable bit of information.  The /home directory
itself is nfs mounted on the box running nagios.  The nagios binaries
reside on the mount itself.  In light of your suggestion, my very next
test will be to copy /home locally and eliminate this variable.

However, I do no see nay processes in the ps list that show as
uninterruptible or disk-wait.

> What version of plugins are you running? Which check is running when
it 
> hangs?

Running plugins of: nagios-plugins-1.4
Typically the plugin that I see running is a check_ping.  However due to
the high number of retries and packets I have check_ping set to make, it
takes a good 30 seconds or more of pinging before it returns failure.

The hosts I am trying to hit are behind a firewall that drops my pings
so the host is seen as down.  I have done the same tests from a system
that does have permission to ping the hosts, but the problem still
exists, it is just not as obvious.  I wanted to work on a system that
showed the problem as obviously as possible when it was broken.

> So in essence it always happens when you run Nagios, no matter how you

> compiled it, but never when you're running it from strace?

The problem occurs no matter how I compile nagios, when running nagios
by itself.

The problem occurs when I run non-debugged nagios with "strace"

The problem is fixed when I run non-debugged nagios with "strace -f"
The problem is fixed when I run debugged nagios with "strace"

> Have you tried this with 2.0rc1 or 2.0rc2 ?
I have not tried these versions.

> Do you get any messages in the nagios.log saying something like:
> service_result_worker_thread: poll(): (text-rep of errno) ?

I see no messages like this at all in the nagios.log

> Are you going to do this upgrade or have you already done it? 
I have the old system running the exact same configs still in place

> Was the kernel compiled with a 64-bit compiler?
I assume so.  I am using standard 64-bit RedHat kernels

> Was glibc and the thread-library compiled with a 64-bit compiler?
I assume so.  I am using stock libraries distributed with RH.

> What versions of kernel, glibc and thread-library are you using?
Kernel: 2.6.9-22.0.1.ELsmp #1 SMP Tue Oct 18 18:39:02 EDT 2005 x86_64
x86_64 x86_64 GNU/Linux

Glibc: glibc-2.3.4-2.13
This is the information I have about pthread on my system
/lib64/tls/libpthread-2.3.4.so
/lib64/tls/libpthread.so.0
/lib64/libpthread-0.10.so
/lib64/libpthread.so.0

> What flavour of thread-library are you using (linux-threads or nptl)?
I don't know the answer to this.

> Try disabling embedded perl. When embedded perl is enabled
(particularly 
> with caching), the routine Nagios goes through after the fork() call
is 
> quite frankly so thread-unsafe that it's a miracle that it works 
> anywhere at all.

Ok, I will put this on my list of trials.

> A Heisenbug... Nasty stuff. Running things through strace
unfortunately 
> causes different rules to apply for signals and mutexes (strace reads 
> the output of the child-process directly, so there is less locking
going 
> on), and since it runs a lot slower mutexes that would possibly have 
> been held if it weren't for strace have time to be released prior to
the 
> fork() call.
Sigh . ..  yup, as ugly as it comes.

> Homework one is to come up with answers for all those questions I
asked.
Done

> Fix-attempt one is to try the newest release of Nagios available. In 
> particular I think you'll need the patch I submitted 2005-05-05 (after

> 2.0b3 was released), which adds a couple of flag-macros that's
supposed 
> to alter the behaviour of the C pre-processor somewhat.
> 
> Fix-attempt two is to try re-compiling with embedded perl and the 
> perl-cache disabled.

I think I will try this order:
a) disable embedded perl and perl-cache
b) move /home to local volume
c) get cvs version of nagios and try it (with the above two changes in
place) If it works, I will reverse b then a and see where/if it breaks.

> Keep us posted, will you?

Absolutely.  Thank you for your suggestions,
Ben


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_idv37&alloc_id865&op=click




More information about the Developers mailing list