freshness check on passive service fails

Antoine Reid areid at logient.com
Thu May 27 23:13:16 CEST 2004


--On Monday, May 24, 2004 12:48 PM +0200 Michael Huettig 
<Michael.Huettig at Medien-Systempartner.de> wrote:

> Hi all,
> i´m using nagios 1.2 with nsca/send-nsca 2.4 to submit passive
> check-results  from some services. Works fine for more than 6 months but
> last week it starts  up making me crazy.
>
> nagios doesn´t accept any value on freshness-threshold, it starts every 5
> minutes the script, which notifies me.

For what it's worth, I'm having similar issues myself too. My setup is a bit
different so I'll post it below.  What happens here is that I have two
Nagios processes running on two different hosts, in different subnets. The 
one
doing the actual checks is obsessing over services and sends the results
through nsca to the main nagios host.  The main host seems to decide my
services results aren't fresh enough, then runs the check_command, which is
a dummy script returning WARNING (originally CRITICAL but it generated too
many notifications..), then, a couple seconds or minutes later, a new 
passive
check comes in, which brings the service(s) back to OK, then a couple 
minutes
later, it switches back to WARNING and so on..

Both hosts are running FreeBSD, one is on 4.9 (the main host) while the one
performing the actual checks is running 5.2.1.  All on i386.

Complete configs can be made available upon request (sent out-of-band to 
save
list bandwidth) if I didn't provide enough details..


I'm sure I'm either not using the software the way it's supposed to be, or
I have a configuration glitch, but I can't seem to find it.. I find it so 
odd
that the main nagios process would run the service_check only couple 
*seconds*
after it has got an "OK" passive check.  This type of service is set with
"active_checks_enabled 0" and "check_freshness 1", and I understood it would
only run the service check IF the results aren't fresh enough..

Anyone can shed some light on this?


Here are excerpts from my configs:

On the MAIN nagios machine (the one that receives the passive checks
and does notifications):

nagios.cfg: (not sure what is relevant here..)

ocsp_timeout=5
interval_length=60
execute_service_checks=1
accept_passive_service_checks=1
obsess_over_services=0
check_service_freshness=1
freshness_check_interval=600

and from service.cfg:

define service{
        name                            passive-service
        active_checks_enabled           0
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 1
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1

        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   service-is-stale
        freshness_threshold             600

        register                        0
        }

define service{
        use                             passive-service
        service_description             PING

        host_name                       bloodymary.domain.logient.com
        contact_groups                  unix-admins
        }

define service{
        use                             passive-service
        service_description             DNS

        host_name                       bloodymary.domain.logient.com
        contact_groups                  unix-admins
        }

(I have a bunch of services with "use passive-service" all configured this 
way,
and they all produce the same behaviour..)


Here is the "service-is-stale" command:

define command{
        command_name    service-is-stale
        command_line    $USER1$/staleservice.sh
        }

And the staleservice.sh script:

#!/bin/sh
/bin/echo "WARNING: Service results are stale!"
exit 1

--------------------------------------------------------------------------

On the *other* machine, also running Nagios, here are the config excerpts:

ocsp_timeout=5
interval_length=60
execute_service_checks=1
accept_passive_service_checks=1
enable_notifications=0
enable_event_handlers=1
obsess_over_services=1
ocsp_command=submit_check_result


define service{
        name                            generic-service
        active_checks_enabled           1
        passive_checks_enabled          1
        parallelize_check               1
        obsess_over_service             1
        check_freshness                 0
        notifications_enabled           1
        event_handler_enabled           1
        flap_detection_enabled          1
        process_perf_data               1
        retain_status_information       1
        retain_nonstatus_information    1

        register                        0
        }


and in services.cfg:

define service{
        use                             generic-service

        host_name                       bloodymary.domain.logient.com
        service_description             PING
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  contactgroup
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   check_fping!2000,80%!5000,100%
        }


define service{
        use                             generic-service

        host_name                       bloodymary.domain.logient.com
        service_description             DNS
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           1
        retry_check_interval            1
        contact_groups                  contactgroup
        notification_interval           120
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   check_dig!dev.domain.logient.com!10
        }



My submit_check_result looks like this:
define command{
        command_name    submit_check_result
        command_line 
/usr/local/libexec/nagios/eventhandlers/submit_check_result $HOSTNAME$ 
'$SERVICEDESC$' $SERVICESTATE$ '$OUTPUT$'
        }

any the script itself contains:

-----
#!/bin/sh

# Arguments:
#  $1 = host_name (Short name of host that the service is
#       associated with)
#  $2 = svc_description (Description of the service)
#  $3 = state_string (A string representing the status of
#       the given service - "OK", "WARNING", "CRITICAL"
#       or "UNKNOWN")
#  $4 = plugin_output (A text string that should be used
#       as the plugin output for the service checks)
#

# Convert the state string to the corresponding return code
return_code=-1

case "$3" in
        OK)
                return_code=0
                ;;
        WARNING)
                return_code=1
                ;;
        CRITICAL)
                return_code=2
                ;;
        UNKNOWN)
                return_code=-1
                ;;
esac

# pipe the service check info into the send_nsca program, which
# in turn transmits the data to the nsca daemon on the central
# monitoring server

# Used for debugging only..
#/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >> 
/tmp/send_nsca.log

/usr/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" | 
/usr/local/libexec/nagios/send_nsca 192.168.10.138 -c 
/usr/local/etc/nagios/send_nsca.cfg
-----


I'm using printf instead of echo, otherwise I had problems with some
plugin_output's which didn't work because they contained "%" signs..

------------------------------------------------------

Now, here is what I get on the main machine's log:

[1085691545] SERVICE ALERT: 
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691545] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200 
(loss=0.000000%, rta=0.260000 ms)
[1085691593] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;DNS;0;DNS ok - 0 
seconds response time (dev.domain.logient.com.  1H IN A  192.168.0.201)
[1085691595] SERVICE ALERT: bloodymary.domain.logient.com;DNS;OK;SOFT;2;DNS 
ok - 0 seconds response time (dev.domain.logient.com.  1H IN A 
192.168.0.201)
[1085691595] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691602] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;PING;0;FPING OK 
- 192.168.0.200 (loss=0.000000%, rta=0.310000 ms)
[1085691605] SERVICE ALERT: 
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691605] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200 
(loss=0.000000%, rta=0.310000 ms)
[1085691652] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;DNS;0;DNS ok - 0 
seconds response time (dev.domain.logient.com.  1H IN A  192.168.0.201)
[1085691655] SERVICE ALERT: bloodymary.domain.logient.com;DNS;OK;SOFT;2;DNS 
ok - 0 seconds response time (dev.domain.logient.com.  1H IN A 
192.168.0.201)
[1085691656] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691663] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;PING;0;FPING OK 
- 192.168.0.200 (loss=0.000000%, rta=0.330000 ms)
[1085691665] SERVICE ALERT: 
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691665] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200 
(loss=0.000000%, rta=0.330000 ms)
[1085691713] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;DNS;0;DNS ok - 0 
seconds response time (dev.domain.logient.com.  1H IN A  192.168.0.201)
[1085691715] SERVICE ALERT: bloodymary.domain.logient.com;DNS;OK;SOFT;2;DNS 
ok - 0 seconds response time (dev.domain.logient.com.  1H IN A 
192.168.0.201)
[1085691715] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691723] EXTERNAL COMMAND: 
PROCESS_SERVICE_CHECK_RESULT;bloodymary.domain.logient.com;PING;0;FPING OK 
- 192.168.0.200 (loss=0.000000%, rta=0.370000 ms)
[1085691725] SERVICE ALERT: 
bloodymary.domain.logient.com;DNS;WARNING;SOFT;1;WARNING: Service results 
are stale!
[1085691725] SERVICE ALERT: 
bloodymary.domain.logient.com;PING;OK;SOFT;2;FPING OK - 192.168.0.200 
(loss=0.000000%, rta=0.370000 ms)




> on host mh2 run´s an cron-script which submits the passive check-result
> via  send-nsca every hour. So i receive every hour that
> test-service-passive is o.k. but after 5 Minutes nagios wants to check
> freshness of this service.

as you can see above, I'm using another nagios process instead of cron, but 
the
result should be the same..

> Any suggestions, ideas, why nagios doesn´t accept the
> check-freshness-period  of 4000 seconds?
>
> Regards,
>
> Michael



Thanks to anyone who read this far :)
antoine

--
Antoine Reid
Administrateur Système - System Administrator

          __________________________________________________

Logient Inc.
 Solutions de logiciels Internet - Internet Software Solutions
 417 St-Pierre, Suite #700
 Montréal (Qc) Canada H2Y 2M4
 T. 514-282-4118 ext.32
 F. 514-288-0033
 www.logient.com

*AVIS DE CONFIDENTIALITÉ*
 L'information apparaissant dans ce message est légalement privilégiée et
confidentielle. Elle est destinée à l'usage exclusif de son destinataire
tel qu'identifié ci-dessus. Si ce document vous est parvenu par erreur,
soyez par la présente avisé que sa lecture, sa reproduction ou sa
distribution sont strictement interdites. Vous êtes en conséquence prié de
nous aviser immédiatement par téléphone au (514) 282-4118 ou par courriel.
Veuillez de plus détruire le message. Merci.

*CONFIDENTIALITY NOTE*
 This message along with any enclosed documents are confidential and are
legally privileged. They are intended only for the person(s) or
organization(s) named above and any other use or disclosure is strictly
forbidden. If this message is received by anyone else, please notify us at
once by telephone (514) 282-4118 or e-mail and destroy this message. Thank
you.



-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id149&alloc_id66&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list