Last ditch effort

Marc Powell marc at ena.com
Wed Mar 24 18:29:57 CET 2004


On Wednesday, March 24, 2004 10:22 AM, Aaron Levitt shared with us:


> I have recently upgraded to nagios 1.1 from an older release of
> netsaint after a long time of faithful service.  Since the upgrade,
> we have been getting some random timeouts, with an average of 1 or 2
> a day.  All the information I really have to go on, is the output
> from nagios.  The mail contains "Info: CRITICAL - Plugin timed out
> after 10 seconds".  The logs have the same information, but nothing
> more helpful.  I'm not sure where it's getting the 10 seconds from. 
> Initially I thought it was nrpe timing out, but it seems to be random
> services and hosts as well.        

Besides the global timeout values, each plugin usually supports it's own
unique timeout, typically specified with the -t command line option. If
no -t option is specified, most plugins automatically default to 10
seconds. Run your favorite plugin with --help for options and timeouts
specific to that plugin. One or two timeouts a day really is minimal and
could easily be explained by transient network problems. That's why you
specify retries to verify the state. Not having any information about
the topology of your network and the location of hosts relative to
Nagios, it's really hard to provide any insight. I can say that if
you're checking hosts on the other end of an ISDN line you're more
likely to have timeouts compared to a host on the local lan. One thing
to make sure of, and it's often a problem, is to make sure that you have
speed and duplex hardcoded on all machines and switches that they
connect to. Auto-negotiation is unreliable in many cases. If you're
checking hosts over a WAN, are there capacity issues at certain times of
the day? Are there any interface errors anywhere in the path between
your Nagios host and the devices you are monitoring? Is the host you're
checking simply slow to respond due to load or other issues (i.e.
reverse DNS lookups?)
 
> So far, I have changed max_concurrent_checks and various timeout
> values in nagios.cfg.  As well as changing max_check_attempts and
> normal_check_interval to make sure there wasn't too much going on at
> the same time (which really shouldn't matter since nagios is only
> monitoring about 60 hosts).  I poked through the source code but
> couldn't find anything with a 10 second timeout. 

Have you tried running /path/to/nagios -v /path/to/nagios.cfg to get a
recommended value for max_concurrent_checks? That will be variable
depending on your normal_check_interval, total hosts and services and
other variables. Also, the number of services you're checking is a far
more important number than the the number of hosts you're checking,
since nagios never checks hosts unless a service on that host fails. If
you did a full suite of service checks on a host that could equate to
500+ service checks. I do believe that if you're only monitoring 2 or
three quick to finish service checks on 60 hosts at 3-5 minute intervals
then max_concurrent_checks will probably be very low and not much of a
factor.
    
> 
> Currently nagios is running on it's own box, no other services are
> running on it.  It's a 2.4.20 kernel on Redhat 9 and the hardware is
> a PIII 800Mhz and it's got 384Mb of RAM.  Nothing very special, but
> that should be enough I would think.   

Yes, should be fine. I'm doing 1000+ service checks at 5 minute
intervals with similar hardware. Are you using state retention? How
often are you saving state? If it's very often you could be getting into
a state where nagios is spending more time writing the status.sav file
than processing checks (not likely though unless it's every 10 seconds
or so in your case).

--
Marc





-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list