(Service Check Timed Out) returns critical

Michael Markstaller mm at elabnet.de
Tue Nov 19 20:51:19 CET 2002


Hi again,

Today I had the same experience again, nagios and the was almost dead
while a router was offline with 13 hosts behind it (yes, parents are
set) after a while. First it alerted the outage correctly but then more
and more processes, up to 250 of "/usr/local/nagios/bin/nagios -d
/etc/nagios/nagios.cf" appeared and slowly disappeared.
The checks are mostly check_snmp (approx 4 per host) where I've seen a
very high CPU-usage from each parallel snmpget consuming nearly 100% of
BOTH PIII-866 cpu's. Is this "normal" ?
It's Redhat 7.1 with Kernel 2.4.18-18.7.xsmp and
ucd-snmp-4.2.5-7.71.0.rpm , maybe my problem is the high cpu-usage by
the check_snmp commands ? My nagios is 1.0b6 with plugins 1.3-beta1.

And then again, is it possible to configure the service_check_timeout to
i.e. 10 seconds and let nagios kill the unresponsive check-commands then
with returning an UNKNOWN-state instead of CRITICAL to avoid such hangs
and false alerts in any way and to enable nagios to continue ? Because
no service-check, but never a check_snmp should ever take longer than
very few seconds, otherwise something is wrong. I also tried to play
around with the -t parameter but this doesn't seem to help. When I
execute i.e. snmpwalk to unresponsive hosts on the commandline it
immidiately return an error.
Is there a way to see which checks acutually hangs and what the
nagios-processes I see there is "doing" (I'm no unix-guru ;) so I can
see more details what happens the next time it hangs ?

Thanks for your help


Michael

-----Original Message-----
From: Michael Markstaller 
Sent: Monday, November 18, 2002 4:28 PM
To: nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] (Service Check Timed Out) returns critical


Hi,

I have set parents for each host. My problem is, that when a very
"close" host fails, nagios seems to mess up (>100 nagios processes
running) with a high number of service-checks which never finish and
then reports critical alerts for *some* of them because of (Service
Check Timed Out). As said, we're talking about 100 hosts, 350 services,
getting more each day. The system is RH 7.1 on a Dell PE Dual PIII-866
with 256MB and HW-RAID-1.

First, I want to tell Nagios to kill unresponsive service-checks (from
nagios point of view) after a shorter time and recongnise them as
UNKNOWN rather than CRITICAL to "help" nagios to be able to execute
further checks and find the unresponsive parent-host via pinging.

It seems like nagios sometimes isn't able anymore to check and determine
all stati anymore due to the high load of the many pending checks and
therefore unable to "find" the first host in the dependencies that
really failed. But that's not my real problem, I tried to workaround
this with a lower service_check_timeout to kill unresponsive
service-checks before they overload the machine and nagios; result is
the above, I get *some* alerts of critical services instead of the
failed host and the machine is nearly unresponsive (loadavg above 5).
The easiest way to resolve is to restart the nagios-process but this
would be rather unpractical..

Michael

-----Original Message-----
From: Scott [mailto:lists.scott at themagicbox.net] 
Sent: Monday, November 18, 2002 2:38 PM
To: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] (Service Check Timed Out) returns critical


On your hosts.. always set a parent.. this way when a host becomes
unreachable it will walk to parent tree and see where the network has
actually failed.. This is basically a dependancy of hosts and makes for
a
lot less pages/emails when something closer to nagios fails.

Example:

efine host{
        host_name               some.host
        alias                   some.host.alias
        address                 some.hosts.ip.address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   40
        notification_period     24x7
        notification_options    d,r
        parents                 some.switch.on.my.network
        }

This means that on check-host-alive of some.hosts.ip.address failing, it
checks some.switch.on.my.network to ensure it is actually the host that
has failed and in case the switch has failed. Then it only pages for
that
and sets a blocking outage on the web page.. pretty nifty :)

Scott


Michael Markstaller said:

> Hi,
>
> I'm using nagios to check approx 100 hosts and 350 services working
fine
> so far.
> I'm asking myself if it's possible to tell nagios to report "unknown"
> instead of critical if a service check times out ? I tried to set the
> "service_check_timeout" in nagios.cfg to 30 to have nagios kill
> non-responsive service-checks quicker in case of a high load due to
many
> unreachable hosts (see below) but this resulted in getting dozens of
> cirtical-alerts due to (Service Check Timed Out) with check_snmp.
> Because I'd prefer to get "unknown" in case of any plugin-timeout
error
> not resulting in a retrieved value. Or maybe this problem is located
> within check_snmp ?
>
> The hosts are mostly routers and quite distributed, so I have made
> dependencies for all hosts to get a notification only on the host
> failing but this doesn't work so well like I think it should. If for
> instance the first router on which all others are depending fails,
> nagios messes quite up with a few hundred processes for pending checks
> and gives me many false alerts instead of the causing the problem.
> Anybody with some general giudeline to help getting useful alerts when
> something "core" fails (like the switch the nagios-server is attached
to
> or DNS etc.)
>
> Thanks,
>
> Michael Markstaller
>
> Elaborated Networks GmbH
> www.elabnet.de
> Lise-Meitner-Str. 1, D-85662 Hohenbrunn, Germany
> fon: +49-8102-8951-60, fax: +49-8102-8951-80
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by: To learn the basics of securing
> your web site with SSL, click here to get a FREE TRIAL of a Thawte
> Server Certificate: http://www.gothawte.com/rd524.html
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
>



-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users


-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users


-------------------------------------------------------
This sf.net email is sponsored by: To learn the basics of securing 
your web site with SSL, click here to get a FREE TRIAL of a Thawte 
Server Certificate: http://www.gothawte.com/rd524.html




More information about the Users mailing list