Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2

Andreas Ericsson ae at op5.se
Thu Jan 12 13:47:13 CET 2006


Ben Miller wrote:
> My latest tests and findings:
> 
> a) disable embedded perl and perl-cache
> 
> I did this and the results were exactly the same as before
> 
> b) move /home to local volume
> 
> Again, the results are the same as before, no improvement.
> 
> c) get cvs version of nagios and try it (with the above two changes in
> place) If it works, I will reverse b then a and see where/if it breaks.
> 
> I downloaded the snapshot and still the behavior is the same as
> originally described.
> 
> During these tests I observed the following behavior.
> The threading seems to startup ok and I see the proper number of checks
> occurring.  I have a lot that are snmp checks.  When the first
> check_ping process starts I see the following process tree and slowly
> the other checking threads die off until only one thread remains.  The
> remaining thread is the check_ping thread.  When it finally completes,
> only one check at a time is performed from then on.  This seems to
> support you thought that a child process blocking the parent somehow.
> 
> 29637 pts/1 Sl+ 0:00 \_ ../bin/nagios nagios.cfg
> 30056 pts/1 S   0:00    \_ ../bin/nagios nagios.cfg
> 30057 pts/1 S   0:00        \_ /home/nagios/nagios/libexec/check_ping
> -p 10 -H 192.168.10.10 -w 100:60% -c 600:100%
> 30058 pts/1 S   0:00            \_ /bin/ping -n -U -w 16 -c 10
> 192.168.10.10
> 

Perhaps you can try using check_icmp instead? That way you would get rid 
of one set of file-descriptors, and the Nagios process in charge of the 
plugin will be the process' parent rather than its grandparent.

Unfortunately I accidentally firewalled myself out of oss.op5.se, or I 
would have put up a package there for you to download with latest 
check_icmp inside. I can upload it tonight (+0100) when I get home from 
work.

> I upgraded to the latest plugins and this behavior remains.  Somehow
> strace -f seems to handle the check_ping blockage and let the app behave
> properly


This is to be expected since strace makes the program behave differently.


> I am out of ideas of what to test next.  Does this evidence help?  What
> is the next step?
> 

Try replacing check_ping with check_icmp. If that doesn't work this bug 
needs to be found and fixed in Nagios, which is non-trivial to say the 
least and near impossible without knowing what it is that breaks.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click




More information about the Developers mailing list