Service checks and retry check interval

Tom Valdes Tom.Valdes at flamenconetworks.com
Thu Jun 17 15:35:07 CEST 2004


Ok, I understand that when the service check fails it moves on to the host check.
As you see, my max_check_attempts is set to 5 for the host check.  Shouldn't this delay sending out a notification until it checks it 5 times?  And once it's down, is there a way to speed up a check to determine recovery?

The problem I'm having is that if Nagios misses a ping due to network congestion or whatever, it takes 5 minutes to realize that nothing is really wrong when all that happen was a missed ping which may have been caught if it simply did another check before sending out a notification.
 
-----Original Message-----
From: Marc Powell [mailto:marc at ena.com] 
Sent: Wednesday, June 16, 2004 7:38 PM
To: Tom Valdes; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Service checks and retry check interval


That is correct and by design. Nagios must determine the status of a questionable host before it does anything else. If it didn't, the dependency and network reachability logic could be flawed as well as sending out spurious alerts for services on a host that is down when they really should be suppressed (http://nagios.sourceforge.net/docs/1_0/networkreachability.html and the Host Checks section of http://nagios.sourceforge.net/docs/1_0/checkscheduling.html).

--
Marc

p.s. Please post to the list in plain text format. It makes it much, much easier to reply with proper quoting and you're going to reach a much larger audience who can help you.
________________________________________
From: Tom Valdes [mailto:Tom.Valdes at flamenconetworks.com] 
Sent: Wednesday, June 16, 2004 6:25 PM
To: Marc Powell; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Service checks and retry check interval

I had changed the 10 retries to 5 after I grabbed the copy of the status.  I did reload Nagios so that's just an old capture.
I think I understand what you mean about performing the host check and bypassing a service check, but it seems a retry_check_interval value is not allowed in the hosts.cfg          
 
---------------services.cfg------------------
--------------------------------------------------
define service{
        use                             generic-service         ; Name of service template to use
        host_name                       Test-Server
        service_description             PING
        is_volatile                     0
        check_period                    workhours
        max_check_attempts              5
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  test-contact
        notification_interval           960
        notification_period             workhours
        notification_options            c,r
        check_command                   check_fping!50%!100%
        }
--------------------------------------------------
------------------hosts.cfg-------------------
define host{
        use                     generic-host            ; Name of host template to use
        parents                 switch1
        host_name          Test-Server
        alias                   TestServer
        address                 10.0.0.21
        check_command           check-host-alive
        max_check_attempts      5
        notification_interval   30
        notification_period     24x7
        notification_options    d,u,r
        }
----------------------------------------------------


________________________________________
From: nagios-users-admin at lists.sourceforge.net on behalf of Marc Powell
Sent: Wed 6/16/2004 5:42 PM
To: Tom Valdes; nagios-users at lists.sourceforge.net
Subject: RE: [Nagios-users] Service checks and retry check interval


________________________________

>From: Tom Valdes [mailto:Tom.Valdes at flamenconetworks.com]
>Sent: Wednesday, June 16, 2004 2:55 PM
>To: nagios-users at lists.sourceforge.net
>Subject: [Nagios-users] Service checks and retry check interval

> I currently have my normal_check_interval set to 5 minutes

> If a service check is missed, I'd like it to retry 5
> times before sending a notification and I'd like the
> retry interval to be 1 minute.  (can it be less? 
> Like 10 seconds?)

>I've tried adding the following to services.cfg

>        max_check_attempts         5
>        normal_check_interval        5
>        retry_check_interval           1

I presume this is for the service definition. Can we see the complete
definition?

> Shouldn't this retry a failed check every minute
> for 5 tries before sending a notification?

For the service above under normal circumstances, yes. I use 5,5,3 to
delay notifications by ~15 minutes.

> Using a test server, I pull the plug and Nagios
> catches the 100% ping loss but if I plug it back
> in as soon as it notices, Nagios emails me right
> away and doesn't return an Up state for another
> 5 minutes?

For the service or the host? See below.

> The following is what I receive on the status
> screen.. It shows a State Type: HARD.. Shouldn't
> it be in a SOFT state until it completes the
> max_check_attempts?

> Current Status:   CRITICAL   
> Status Information:FPING CRITICAL - 192.168.100.21 (loss=100.000000% )
> Current Attempt:1/10

Why is max attempts showing 10 here if it's defined as 5 above? Did you
restart nagios after making the change? Do you have multiple nagios
processing running?

There is a special situation that results when you just 'pull the plug'
on a machine you're monitoring. The service check will of course fail on
the first attempt. Nagios will then attempt to check the status of the
host using the host check_command. It will do this exclusively until
max_check_attempts defined for the host is reached and will not attempt
to recheck the status of the service if the host is determined to be
down or unreachable. At that point nagios will attempt to send a HOST
down notification which may be what you are seeing. Because of this
special situation, your retry_check_interval for the service has no
meaning. AFAIK, nagios just falls back to normal_check_interval until
one or more services on the host recovers (and the host by inference).

--
Marc


-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null



-------------------------------------------------------
This SF.Net email is sponsored by The 2004 JavaOne(SM) Conference
Learn from the experts at JavaOne(SM), Sun's Worldwide Java Developer
Conference, June 28 - July 1 at the Moscone Center in San Francisco, CA
REGISTER AND SAVE! http://java.sun.com/javaone/sf Priority Code NWMGYKND
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list