[patch] Workaround for 'Host DOWN' false-positives

Jan Kratochvil lace at jankratochvil.net
Sat May 20 18:17:35 CEST 2006


Hi,

script for using service "Connectivity" to detect HOST-DOWN/UP states:

I had many "false positive" host-down alerts. It was reported here many times
but usually no real cause was found, like (just guessing):
	http://sourceforge.net/mailarchive/message.php?msg_id=8739666
	http://sourceforge.net/mailarchive/message.php?msg_id=1758384

By the attached debugging script "check-host-alive-debug" I found the problem
is that occasionally the host is really unreachable but only for periods like
under 1 minute. This is usual due to the Internet routing and it annoys me to
be alerted for it.

Unfortunately Nagios stops its operations and only checks the single host
host->max_check_attempts times without any delays to determine if it is alive.

Tried delaying all (or later just the failed) checks but either the sensitivity
was still too high (and too short failures were still reported) or the total
time Nagios got blocked during multiple real long-term hosts failures blocked
out the Nagios checking services completely.

Attached script delegates the host-alive checking to the standard Nagios
services checking and if the service check will detected after some time
define service {
	service_description	Connectivity
	max_check_attempts	20
	normal_check_interval	5
	retry_check_interval	5
	...
}
that the SERVICE is really down THEN ONLY the HOST is immediately declared as
DOWN.
define host {
	check_command		check-host-alive
	max_check_attempts	1
	...
}
All the services have explicit dependencies defined such as:
define servicedependency {
	host_name			SAME-HOSTNAME
	dependent_host_name		SAME-HOSTNAME
	service_description		Connectivity
	dependent_service_description	Total Processes
	execution_failure_criteria	u,c,p
	notification_failure_criteria	u,c,p
}

You can also use service "SSH" (second try) instead of "Connectivity".
The script has some trivia hardcoded pathnames, check yourself.
The real solution is fix the Nagios scheduling but this way was easier for me.

Development paid by the courtesy of JK Labs s.r.o.


Regards,
Jan Kratochvil
-------------- next part --------------
#! /usr/bin/perl
use strict;
use warnings;

my %dat;
my %cache;

sub fetch
{
	my $open;

	local *DAT;
	do { open DAT,$_ or die "Open \"$_\": $!"; } for $ENV{"HOME"}."/nagios/var/log/nagios/status.dat";
	local $_;
	while (<DAT>) {
		next if /^\s*#/;
		next if /^\s*$/;
		if (/^(\w+)\s+{\s*$/) {
			die "Already open: $open" if $open;
			$open=$1;
			next;
			}
		if (/^\s*}\s*$/) {
			die "Nothing open" if !$open;
			$open=undef();
			next;
			}
		if (/^\s*(\S+)\s*=\s*(.*?)\s*$/) {
			my($left,$right)=($1,$2);
			die "Nothing open" if !$open;
			if ($open eq "host" || $open eq "service") {
				$open.="::$right" if $left eq "host_name";
				}
			if ($open=~/^service::[^:]+$/) {
				$open.="::$right" if $left eq "service_description";
				}
			next if $open=~/^service::[^:]+$/;
			die "Redefined: ${open}::$left" if exists $dat{$open}{$left};
			$dat{$open}{$left}=$right;
			next;
			}
		die "Unknown line: $_";
		}
	close DAT or die "Close: $!";
	die "Stale open" if $open;

	local *CACHE;
	do { open CACHE,$_ or die "Open \"$_\": $!"; } for $ENV{"HOME"}."/nagios/var/log/nagios/objects.cache";
	local $_;
	while (<CACHE>) {
		next if /^\s*#/;
		next if /^\s*$/;
		if (/^define\s+(\w+)\s+{\s*$/) {
			die "Already open: $open" if $open;
			$open=$1;
			next;
			}
		if (/^\s*}\s*$/) {
			die "Nothing open" if !$open;
			$open=undef();
			next;
			}
		if (/^\s*(\w+)\t(\S.*?)\s*$/) {
			my($left,$right)=($1,$2);
			die "Nothing open" if !$open;
			next if $open!~/^host\b/ && $open!~/^service\b/;
			if ($open eq "host" || $open eq "service") {
				$open.="::$right" if $left eq "host_name";
				}
			if ($open eq "service") {
				$open.="::$right" if $left eq "service_description";
				}
			next if $open=~/^service::[^:]+$/;
			die "Redefined: ${open}::$left" if exists $cache{$open}{$left};
			$cache{$open}{$left}=$right;
			next;
			}
		die "Unknown line: $_";
		}
	close CACHE or die "Close: $!";
	die "Stale open" if $open;
}

fetch();

#use Data::Dumper;
#print Dumper(\%dat);
#print Dumper(\%cache);

my %ip_to_hostname;
my %proxy_ip_to_hostname;
while (my($key,$val)=each(%cache)) {
	next if $key!~/^host::([^:]+)$/;
	my $hostname=$1;
	next if !$val->{"address"};
	$ip_to_hostname{$val->{"address"}}=$hostname;
	if (my $parent_hostname=$val->{"parents"}) {
		my $parent_record=$cache{"host::$parent_hostname"}
				or die "Neni zaznam pro: $parent_hostname pro: $hostname";
		my $parent_ip=$parent_record->{"address"}
				or die "Neni adresa pro: $parent_hostname pro: $hostname";
		$proxy_ip_to_hostname{$parent_ip}{$val->{"address"}}=$hostname;
		}
	}
#print Dumper(\%ip_to_hostname);
#print Dumper(\%proxy_ip_to_hostname);

die "Expecting -H <IP> [--proxy <IP>] [-p <port>] [...]" if @ARGV<2 || shift ne "-H";
my $ip=shift;
my $proxy;
do { shift; $proxy=shift;                          } if $ARGV[0] && $ARGV[0] eq "--proxy";
do { shift; ($proxy ? $proxy : $ip).=" -p ".shift; } if $ARGV[0] && $ARGV[0] eq "-p";

my $hostname;
if (!$proxy) {
	die "Unknown ip: $ip" if !($hostname=$ip_to_hostname{$ip});
	}
else {
	die "Unknown ip: $ip over proxy: $proxy" if !($hostname=$proxy_ip_to_hostname{$proxy}{$ip});
	}
my $state;
my $state_service;
for (qw(Connectivity SSH)) {
	# "current_state"
	next if !defined(($state=$dat{"service::${hostname}::$_"}{"last_hard_state"}));
	$state_service=$_;
	last;
	}
die "No state for: $hostname" if !defined $state;
die "Weird state: $state" if $state!~/^[012]$/;

print "State $state_service $state copy for IP $ip hostname $hostname\n";

exit $state;
-------------- next part --------------
#! /bin/bash
date="`date --rfc-3339=seconds`"
t=/tmp/check-host-alive-tmp.$$
rm -f $t.*
/home/jklabs/nagios/libexec/check-host-alive-orig "$@" >$t.1 2>$t.2
rc=$?
echo >>/tmp/check-host-alive.log "$date: `date --rfc-3339=seconds`: rc=$rc $* 1{`tr '\n' '|' <$t.1`} 2{`tr '\n' '|' <$t.2`}"
cat $t.1
cat >&2 $t.2
rm -f $t.*
exit $rc


More information about the Users mailing list