[patch] Workaround for 'Host DOWN' false-positives

Bruce Campbell nagios-devel at vicious.dropbear.id.au
Sun May 21 13:26:49 CEST 2006


On Sat, 20 May 2006, Jan Kratochvil wrote:

> script for using service "Connectivity" to detect HOST-DOWN/UP states:
>
> Attached script delegates the host-alive checking to the standard Nagios
> services checking and if the service check will detected after some time

Great idea.  Some serious scaling issues in the way you've gone about it 
though (which, for those who didn't follow the explanation, involves the 
host check command scanning the status.dat file for the output of the 
'Connectivity' or 'SSH' service check to return).

To be more precise, you are reading in the complete status.dat and 
objects.cache files each time this script is being run.  Some 
installations have these files over the 10 meg mark, and I suspect that 
reading in the file each time a host check is run might well be a little 
bit noticeable, particularly when using an interpreted language and a lot 
of hosts.

Rather than having your host check command do a lot of work and possibly 
hit the memory, disk and cpu too hard due to Nagios' periodic obsession 
with repetively checking the status of the host, get the service check 
command to do just a little extra bit of work, and submit the host check 
result to Nagios when it runs, leaving the host check to simply do a 
lookup inside Nagios.

( Note, all of the following has been quickly typed up after the influence
   of a rather late night.  I could be completely and utterly wrong )

For instance, try this script as a service check:

 	#!/bin/sh
 	# host_check_wrapper.sh
 	# Call with: host_check_wrapper.sh $HOSTNAME$ $COMMANDFILE$ $USER1$/normal_check_command args
 	shost=$1
 	shift
 	ncmdf=$1
 	shift

 	# Run the remaining command and record the output text.
 	result=`"$@"`
 	# Record the exit code.
 	state=$?

 	# Submit the result to the Nagios (external) command file
 	if [ -p "$ncmdf" -a -w "$ncmdf" ] ; then
 		echo "[`date +%s`] PROCESS_HOST_CHECK_RESULT;$shost;$state;$result" > $ncmdf
 	fi

 	# Return the result to Nagios.
 	echo "$result"
 	exit $state

And the Nagios definitions would be:

 	define command {
 		command_name host_check_wrapper
 		command_line $USER1$/host_check_wrapper.sh $HOSTNAME$ $COMMANDFILE$ $ARG1$ $ARG2$ $ARG3$ $ARG4$ $ARG5$ $ARG6$
 		}

 	# Run the service check fairly frequently.
 	define service {
 		host_name some_host
 		service_description Connectivity
 		check_command host_check_wrapper!$USER1$/check_ping!-w!100.0,20%!-c!500.0,60%
 		normal_check_interval 2
 		etc...
 	}

And finally, define your host as follows: do not perform active checking, 
accept passive results, check the freshness of results such that anything 
within the last 20 minutes is valid, and define a fallback command:

 	define host {
 		host_name		foo.example.com
 		address			1.2.3.4
 		active_checks_enabled	0
 		passive_checks_enabled	1
 		check_freshness		1
 		max_check_attempts	5
 		check_interval		2
 		freshness_threshold	1200
 		check_command		check_dummy!2!Host assumed unreachable
 		}

Define the check_dummy command command.  This is a plugin that comes 
standard with Nagios.  This simply returns the integer given as the first 
argument, and the reason given as the second argument.  In this set-up, 
we're using it to issue an alert if the host's passive check result has 
not been received for 20 minutes.

 	define command {
 		command_name		check_dummy
 		command_line		$USER1$/check_dummy $ARG1$ $ARG2$
 		}

-- 
   Bruce Campbell


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list