Growing number of orphaned service checks...

Andreas Ericsson ae at op5.se
Thu Mar 3 00:49:34 CET 2005


Charles Dee Rice wrote:
> Hello!  I've been lurking on the lists for a while now.  I have a
> smallish-environment of 41 hosts, 397 active checks and 5 passive checks. 
> Most of my active checks are using nrpe-2.0, and my passive checks are
> submitted via nsca-2.4.  I noticed recently that some active checks did not
> appear to be completing in a timely fashion (most are set to an interval of 7
> minutes, but seemed to be taking an hour or more to complete), so I turned on
> check_for_orphaned_services.
> 

What's your plugin_timeout value? It should take care of killing runaway 
plugins. This might fail in case the plugin is running as a different 
user than the nagios process though. No +s bits anywhere in the path to 
or on your plugins?

> Then I saw a rather alarming number of services which nagios was detecting as
> orphaned, and rescheduled for immediate checks.  The longer nagios is left
> running, the longer and longer this list becomes (although it does not
> contain a predictable list of services; in other words, it's not the "same"
> services being orphaned all the time), and the more and more nagios processes
> are left running ("Process Info" reports upwards of 600+ nagios processes
> running).
> 

Definitely not good. This might be due to several master instances 
running simultaneously (an excess master might then reap the check 
results of the actual master process through the waitall() syscall, 
causing the real master never to see the result of the checks). What 
happens if you killall -9 nagios, clean up the garbage and then restart 
it properly from the init script?

> I can restart nagios to "catch up" for some time, but left running, the
> orphaned list begins to grow again.  The monitored nodes are not
> heavily-taxed either.
> 
> My management server is used for other web services (other in-house business
> web pages, user interfaces, etc), but is not in my opinion unusually busy or
> overtaxed.
> 

This isn't a load issue, so don't worry about it.

> I've experimented changing my max_concurrent_checks value from the default of
> 0 to values both above and below what is recommended by running "nagios -s",
> with no noticeable improvement.  I've tried extending my
> normal_check_interval, and that seemed to delay the initial onset of the
> problem (it took longer to start seeing orphaned checks, but they continued
> to grow just the same).
> 
> I'll be happy to post any specific configuration or log file entries as
> anyone sees appropriate, but didn't want to clutter the list with more info
> than needed.
> 

If you feel like it, you could put all of the config up for browsing. 
Make heavy use of sed to obscure sensitive data, like so;
sed 's/\(address[\t ]*\).*/\1xxx.xxx.xxx.xxx/' object.cfg > 
object.cfg.stripped

> I'm using nagios-1.2 on a Linux box running Red Hat Enterprise Linux ES
> release 2.1 (Panama), linux kernel 2.4.9-e.27.

What version of plugins are you using?

>  Updating the Linux release is
> not an option (corporate standard and configuration freeze),

I hope you're aware that linux 2.4.12, 2.4.24 and 2.4.27 all had local 
root exploits, although 24 and 27 only with rather special configurations.

> and running
> nagios-2-beta is highly undesirable due to it's "beta" state and the usual
> corporate fear of beta-release software.  However, if this issue is addressed
> in nagios-2, I might be able to make a business case to upgrade.
> 

I didn't even know the issue existed in 1.2, so I don't think a Nagios 
upgrade will be all that helpful, really. Ofcourse, if you have a spare 
server available you could try running both in parallell for a while and 
see if 2.0 works better for you.

> If there are resources available online which I could use to help
> troubleshoot this issue, please point me to them.

gdb might be a good option. You should recompile your nagios with 
extended debugging symbols in case you decide to use it (-ggdb3) and 
keep the source-files untouched after running ./configure so you can 
follow execution more closely. Enabling the proper DEBUG preprocessor 
directive might also help (I believe there's a special one just for 
debugging checks).

Other than that I'd say a couple of scripts to examine the logfiles for 
missing checks is your best bet.

>  I've quite throughly
> reviewed the 1.2 and 2-beta docs, FAQs and mail archives, and haven't found a
> solution.  If there is more information I can post regarding my
> configuration, please ask away...
> 
> Thanks - Chuck
> 
> 
> 
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list