Growing number of orphaned service checks...

Charles Dee Rice cdrice at pobox.com
Thu Mar 3 19:46:59 CET 2005


Andreas Ericsson wrote:
> What's your plugin_timeout value? It should take care of killing runaway 

Do you mean service_check_timeout?  I have mine set at 60 seconds.  Each 
plugin I call (which allows a timeout argument) is also invoked with a 
60 second timeout (i.e. "$USER1$/check_ssh -t 60 $HOSTADDRESS$").
I do not see any messages in my log files indicating that plugins are 
timing out, nor do any service checks go to "unknown" states from timeouts.

> plugins. This might fail in case the plugin is running as a different 
> user than the nagios process though. No +s bits anywhere in the path to 
> or on your plugins?

Everything is running as nagios:nagios; all exectuables are owned by 
nagios:nagios, and there are no suid/sguid bits set anywhere from / to 
the executable path, nor on the executables themselves.

I have seen plugin timeouts happen, in very specific scenarios unrelated to 
this issue; those problems were explainable and corrected at the time.  

There appears to something special or different about this case which 
precludes nagios from detecting the plugins are timing out.

> Definitely not good. This might be due to several master instances 
> running simultaneously (an excess master might then reap the check 
> results of the actual master process through the waitall() syscall, 
> causing the real master never to see the result of the checks). What 
> happens if you killall -9 nagios, clean up the garbage and then restart 
> it properly from the init script?

Tried that just now.  Same behaviour.

> This isn't a load issue, so don't worry about it.

I didn't necessarily think it was specifically related to load, but perhaps 
to some other system resource(s) which is in some fashion not allowing 
nagios to really "start" a service check, even though it thinks it kicked 
one off and the process entry is created in the process table, or causing 
some race condition not allowing the service check to complete processing.
I don't know the down-and-dirty details of how nagios manages its service
check calls, so perhaps the kinds of race-conditions I'm fearing might not
even be possible.

> If you feel like it, you could put all of the config up for browsing. 
> Make heavy use of sed to obscure sensitive data, like so;
> sed 's/\(address[\t ]*\).*/\1xxx.xxx.xxx.xxx/' object.cfg > 
> object.cfg.stripped

I was considering that, but decided it would be a lot of work with all the 
internal name and address replacements.  :)  I'll see if I can get time to 
sanitize the files so they don't reveal anything internal to our network 
and configuration here and see if I can put them somewhere.  I would hope
this isn't really a configuration problem, though -- it "feels like" I 
shouldn't be able to mis-configure nagios into this kind of state.  This 
smells more like a system resource issue or process managment bug.  Maybe.

> What version of plugins are you using?

Sorry, I neglected to specify.  I'm using 1.4.

> I hope you're aware that linux 2.4.12, 2.4.24 and 2.4.27 all had local 
> root exploits, although 24 and 27 only with rather special configurations.

Long story, but the short version is "our group doesn't support the OS on 
that machine."  :)  We support the machines we are using that machine to 
monitor, but we are essentially "borrowing" time on this server to run 
nagios to watch our machines.  I've already expressed concerns with the 
somewhat out-dated kernel and distrib on that box, but there's nothing 
else I can do about that.  I do have root-access to the box, but only use 
it for whatever is absolutely necessary for our system-monitoring tasks.  
Aside from that, this machine is a "black box" to me.

> I didn't even know the issue existed in 1.2, so I don't think a Nagios 
> upgrade will be all that helpful, really. Ofcourse, if you have a spare 
> server available you could try running both in parallell for a while and 
> see if 2.0 works better for you.

I do not.  I might be able to get a downtime where I could take down my 
existing 1.2 management server and run a 2.0 build for a short period, 
just to test it.  That would take some time to schedule, since this is 
currently a production system.

> gdb might be a good option. You should recompile your nagios with 
> extended debugging symbols in case you decide to use it (-ggdb3) and 
> keep the source-files untouched after running ./configure so you can 
> follow execution more closely. Enabling the proper DEBUG preprocessor 
> directive might also help (I believe there's a special one just for 
> debugging checks).

I might have time to try that, but I don't expect it would be until 
some time next week.  I'll see how things are shaping out schedule-wise
here.

> Other than that I'd say a couple of scripts to examine the logfiles for 
> missing checks is your best bet.

I'm not sure what you mean specifically there.  You mean just generating a 
list of what checks are being missed and verifying nagios is picking them 
up as orphaned?

Thanks - Chuck




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list