Passive Service Checks and Freshness Checking

Demetri Mouratis dmourati at cm.math.uiuc.edu
Fri Nov 5 00:10:49 CET 2004


I'm having a problem recently with passive service checks and freshness
checking.  On my Distributed Nagios, I have a NAGIOS service which checks
the existing Nagios process health.  I also have a HEARTBEAT check which
sends the results of the NAGIOS check from Distributed to Central every
minute.  Here are the service definitions on Distributed:

# NAGIOS
define service{
        use                             dev-service
        service_description             NAGIOS
        hostgroup_name                  dev-local
        check_command                   check-nagios
        max_check_attempts              1
        normal_check_interval           1
        }
# HEARTBEAT
define service{
        use                             dev-service
        service_description             HEARTBEAT
        hostgroup_name                  dev-local
        check_command                   check-heartbeat
        max_check_attempts              1
        normal_check_interval           1
        check_freshness                 1
        freshness_threshold             180
        }

and Central:

# NAGIOS
define service{
        use                             dev-service
        service_description             NAGIOS
        hostgroup_name                  dev-local
        max_check_attempts              1
        normal_check_interval           1
        }
# HEARTBEAT
define service{
        use                             dev-service
        service_description             HEARTBEAT
        hostgroup_name                  dev-local
        max_check_attempts              1
        normal_check_interval           1
        check_freshness                 1
        freshness_threshold             180
        }

Also, on Central, the dev-service template has:

        check_command                   service-is-stale

This creates a nice HEARTBEAT setup where Central gets information about
the health of Distributed every minute:

[1099607624] EXTERNAL COMMAND:
PROCESS_SERVICE_CHECK_RESULT;sj-dev-db02;HEARTBEAT;0;Distributed Nagios
ok: located 5 processes, status log updated 8 seconds ago

I'm testing the condition where Distributed goes down and the HEARTBEAT
messages stop coming.  After 180 seconds, the freshness threshold is
expired and Central detects it like so:

[1099608948] Warning: The results of service 'HEARTBEAT' on host
'sj-dev-db02' are stale by 32 seconds (threshold=180 seconds).  I'm
forcing an immediate check of the service.

The problem I'm seing is that the service-is-stale command is kicked off
only some of the time.  Without this command being executed, the Central
box has stale data indicating HEARTBEAT is OK and we are unaware of the
outage on Distributed.

Does anyone have any idea how the freshness threshold could have expired
and the corresponding service check fails to kick off?

So far, I've recompiled Nagios with the DEBUG0-4 options enabled and
have yet to track down the issue.

Thanks in advance.

---------------------------------------------------------------------
Demetri Mouratis
dmourati at linfactory.com



-------------------------------------------------------
This SF.Net email is sponsored by:
Sybase ASE Linux Express Edition - download now for FREE
LinuxWorld Reader's Choice Award Winner for best database on Linux.
http://ads.osdn.com/?ad_id=5588&alloc_id=12065&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list