Nagios blocking/stalling: Thread issue? v 2.0b3 or 2.0rc2

Andreas Ericsson ae at op5.se
Thu Jan 12 17:06:27 CET 2006



Ben Miller wrote:
> Andreas, 
> Thanks again for your suggestions.
> 
> 
>>Perhaps you can try using check_icmp instead? That way you would get
> 
> rid 
> 
>>of one set of file-descriptors, and the Nagios process in charge of
> 
> the 
> 
>>plugin will be the process' parent rather than its grandparent.
> 
> 
> I tried checki_icmp from the standard pack of plugins and it has
> resolved the problem for me.  For whatever reason, check_icmp does not
> cause the same issue that check_ping does.  It is very clear that
> check_ping somehow "clogs up" (highly technical description) the
> threading and allows Nagios to only use a single thread to do internal
> processing and wait on checks/alerts.
> 
> I use lots of check_snmp commands that spawn snmpget where the
> grandfather relationship works. So I don't know why check_ping breaks
> things.
> 

Perhaps because /bin/ping is a setuid program. check_icmp is too, but it 
drops its inherited privileges immediately after getting the raw socket 
and reclaims the same uid as it had before getting its elevated 
privileges (i.e. the same uid as Nagios has). This means that Nagios can 
slay the process if it has to, which isn't necessarily the case with 
/bin/ping since that's executed from another program. I'm not sure 
though, and others seem to have no problem with running the check_ping 
plugin.

> 
> While I have a solution that allows me to move my project forward, thank
> you very much!!, I am concerned that there is something as simple as a
> check that can cripple the Nagios process.  Is there someone who is
> interested in trying to determine if this is an anomaly or easily
> replicable in other environments as well.
> 

Umm.. yes and no. I'd love to get the bug fixed, but I wouldn't want to 
get bit by it. Considering I wrote check_icmp and won't be using 
check_ping ever again I don't think I'm at risk though.


>>From my tests it would seem that this problem should be easy to
> replicate for testing purposes.  My only recommendation in testing it is
> that you ping something that doesn't exist with a high packet count.
> This is what I have done.
> 

Yes, but others have done that without check_ping locking up Nagios, so 
it's not as easy as that. I believe one way to resolve this issue is to 
move from the current popen() based way of executing programs to a more 
multi-tasking friendly multiplexing variant that handles 
file-descriptors rather than FILE streams and thus bypasses some of the 
mutex locking (in the long run it could be used to do away with threads 
completely). That's a very invasive change which is anything but trivial 
though, so don't hold your breath for it.

> Thanks again for all you help and discussion that led to a workable
> solution.
> 
> Please let me know if I can help in hunting down the root cause if this
> is indeed replicable on other systems.
> 

You're welcome, and the info you've provided so far has already been 
helpful. Thanks.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click




More information about the Developers mailing list