eventhandlers running when a dependent service dependency is not satisfied

Eli Stair estair at ilm.com
Fri Dec 9 06:14:36 CET 2005


I'm not entirely sure I am configuring this properly to achieve my goal, 
so I'll state this shortly and then give the details below.  The 
question comes down to this:

   Should a failed service check for a dependent trigger a check of its 
parent before continuing?  If this is not the case, or default, is there 
_ANY_ way to implement this?

I want to avoid at all costs having an every-minute check of the parent 
processes on many thousand hosts just to keep from having the child 
process checks and event handlers going hay-wire.

I want a dependency chain like this:

   SSH -- SNMP --\
                  - Ganglia
                  - NTP

I believe I have this set up so that a service check for SNMP is 
dependent on the SSH service running.  In turn, the service checks for 
other processes that use SNMP are dependent on SNMP running.  My intent 
is that service checks for NTP,etc will not be attempted if its parent 
SNMP process is not in an OK state (as I have an event handler that will 
restart SNMP if it is dead).  If the parent SNMP _IS_ running, then the 
child process checks (Ganglia, NTP, etc) will be checked and if dead 
their own event handler will activate.

The problem is that in this case, if I kill off SNMP the child process 
checks STILL execute and return a CRITICAL.  As a result, nagios fires 
off the event handler for all these checks which results in an SSH out 
to the nodes in question and restarting a bunch of services that are 
probably still running.  It SHOULD NOT schedule the child checks and 
thus not run their event handlers until AFTER a new parent check has 
returned executed and returned successfully, correct?

I've included a dependency example below, and a snip from the nagios log 
showing it sequentially hammering out checks of all the child processes 
at the same time it already knows the parent is dead.

My apologies for the lengthy post, but I believe I've covered this from 
every angle and posted enough info up front to make it easily parseable. 
  Thanks for any help in this, even if it's just a statement that I'm 
wrong, and I have to do this a different way.

Cheers,

/eli

###################################################
### snip of this host/group definition include:
define host{
         use                     linux-node-production
         host_name               HOSTNAME1
         address                 IP
}

define servicedependency{
         host_name                       HOSTNAME1
         service_description             SSH
         dependent_host_name             HOSTNAME1
         dependent_service_description   SNMP
         execution_failure_criteria      w,p,u,c
         notification_failure_criteria   w,p,u,c
         inherits_parent                 1
}

define servicedependency{
         host_name                       HOSTNAME1
         service_description             SNMP
         dependent_host_name             HOSTNAME1
         dependent_service_description   SNMP--*
         execution_failure_criteria      w,p,u,c
         notification_failure_criteria   w,p,u,c
         inherits_parent                 1
}

define service{
         use                             generic-service
         hostgroup_name                  HOSTGROUP1
         service_description             SNMP
         check_command                   SNMPCHECKCOMMAND
         event_handler 
restart-by-ssh!/etc/init.d/snmpd!restart
         normal_check_interval           30
         }

define service{
         use                             generic-service
         hostgroup_name                  HOSTGROUP1
         service_description             SNMP-- NTP running
         check_command                   SNMPCHECKCOMMAND
         event_handler 
restart-by-ssh!/etc/init.d/xntpd!restart
         normal_check_interval           240
         }
###################################################
[1134102595] SERVICE ALERT: HOSTNAME1001;SNMP-- cron 
running;CRITICAL;SOFT;1;No process matching cron found : CRITICAL
[1134102595] SERVICE EVENT HANDLER: HOSTNAME1001;SNMP-- cron 
running;CRITICAL;SOFT;1;restart-by-ssh!/etc/init.d/cron!restart
[1134102655] SERVICE ALERT: HOSTNAME1001;SNMP-- cron 
running;CRITICAL;SOFT;2;No process matching cron found : CRITICAL
[1134102655] SERVICE EVENT HANDLER: HOSTNAME1001;SNMP-- cron 
running;CRITICAL;SOFT;2;restart-by-ssh!/etc/init.d/cron!restart
[1134102715] SERVICE ALERT: HOSTNAME1001;SNMP-- cron 
running;CRITICAL;SOFT;3;No process matching cron found : CRITICAL
[1134102715] SERVICE EVENT HANDLER: HOSTNAME1001;SNMP-- cron 
running;CRITICAL;SOFT;3;restart-by-ssh!/etc/init.d/cron!restart
[1134102775] SERVICE ALERT: HOSTNAME1001;SNMP-- cron 
running;OK;SOFT;4;(No output returned from plugin)
[1134102775] SERVICE EVENT HANDLER: HOSTNAME1001;SNMP-- cron 
running;OK;SOFT;4;restart-by-ssh!/etc/init.d/cron!restart
[1134104099] EXTERNAL COMMAND: 
SCHEDULE_FORCED_SVC_CHECK;HOSTNAME1001;SNMP-- Ganglia running;1134104073
[1134104476] SERVICE ALERT: HOSTNAME1001;SNMP-- Ganglia 
running;UNKNOWN;SOFT;1;ERROR: Process name table : No response from 
remote host '10.65.29.1'.
[1134104476] SERVICE EVENT HANDLER: HOSTNAME1001;SNMP-- Ganglia 
running;UNKNOWN;SOFT;1;restart-by-ssh!/etc/init.d/gmond!restart



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list