nrpe check times out on one host, works perfectly on another.

Thomas Johnson tom at claimlynx.com
Fri Sep 17 19:45:29 CEST 2010


Hello,

I have got a peculiar issue with nagios that I have been trying to
track down. I have an nrpe check configured to run a perl script on a
pair of firewalls (fwiw, it monitors the state table utilization for
pf). Nagios has no problems checking this script on the first
firewall, but it always times out on the check for the second
firewall. The firewalls are configured [nearly] identically, and if I
run the check_nrpe2 command manually on the nagios host (both as root
and as the nagios user), both hosts respond as expected.

nagios at calvin:~-> /usr/local/libexec/nagios/check_nrpe2 -H
foobar-1-dev.claimlynx.com -c check_pf_states
2187 (%1.0935)
nagios at calvin:~-> /usr/local/libexec/nagios/check_nrpe2 -H
foobar-2-dev.claimlynx.com -c check_pf_states
2383 (%1.1915)

I have restarted nagios and nrpe2 numerous times, with no effect.
Running nrpe with debugging enabled on foobar-2 (the problem host)
never shows any connections from the nagios host for the
check_pf_states check. However, other nrpe checks on this host are
working without any issue. I have also confirmed with tcpdump that
there are no uncompleted connection attempts, or any that request this
check.

The relevant parts of my nagios config are as follows:

define service{
        host_name                       foobar-1, foobar-2
        service_description             Check pf State tables
        use                             generic-service
        check_command                   rcheck_pf_states
        contact_groups                  clx-admins-email-24x7
}

define command{
        command_name                    rcheck_pf_states
        command_line                    $USER1$/check_nrpe2 -H
$HOSTNAME$ -c check_pf_states
}

define host {
        host_name               foobar-1
        alias                   Firewall Primary (Dev Interface)
        address                 foobar-1-dev.claimlynx.com
        use                     new-generic-host
        parents                 sw-dev
}

define host {
        host_name               foobar-2
        alias                   Firewall Secondary (Dev Interface)
        address                 foobar-2-dev.claimlynx.com
        use                     new-generic-host
        parents                 sw-dev
}

The nrpe2 command definition is as follows. It is worth noting that
both firewalls read a shared nrpe config file (to prevent fat fingers
from getting in the way).

command[check_pf_states]=sudo /usr/local/libexec/nagios/check_pf_states

Here is my check script (try not to laugh too hard):

#!/usr/local/bin/perl

# This script checks the number of firewall states currently in use by pf
# and triggers an alert if the number of values exceeds a certain percentage
# of the maximum allowed states.

use strict;
use warnings;
use 5.010;
use lib "/usr/local/libexec/nagios";
use utils qw(%ERRORS);

my $debug;

# Define variables that will be used throughout the script
my $state_limit;	# Will contain the max size of the state table
my $current_states;	# Will contain the current size of the state table
my $adaptive_start;	# Will contain the state count that starts adaptive timeouts

# Get the current state table limit
my @output = `pfctl -s memory 2>/dev/null | grep '^states'`;
say "DEBUG: Found " . @output . " lines while grepping for state table
size." if $debug;
for (@output) {
	if ( m/^states\s+hard limit\s+(\d+)/ ) {
		$state_limit = $1;
		say "DEBUG: State table size: $state_limit" if $debug;
		last;
	}
	say "DEBUG: State table size not found in: $_" if $debug;
}
unless ( defined $state_limit ) { die "State limit not found. Cannot
continue.\n"; }

# Get the value of adaptive start to calculate the critical bound
@output = `pfctl -s timeouts 2>/dev/null | grep '^adaptive.start'`;
say "DEBUG: Found " . @output . " lines while grepping for lower
adaptive timeout bound." if $debug;
for (@output) {
	if ( m/^adaptive.start\s+(\d+) states/ ) {
		$adaptive_start = $1;
		say "DEBUG: Adaptive timeout lower bound is at $adaptive_start
states (" . 100*($adaptive_start/$state_limit) . "%)." if $debug;
		last;
	}
	say "DEBUG: Adaptive timeout lower bound not found in: $_" if $debug;
}
unless ( defined $adaptive_start ) { die "Adaptive timeout lower bound
not found. Cannot continue.\n"; }

# Get the current size of the state table
@output = `pfctl -s info 2>/dev/null | grep 'current entries'`;
say "DEBUG: Found " . @output . " lines while grepping for the number
of state table entries." if $debug;
for (@output) {
	if ( m/^\s+current entries\s+(\d+)/ ) {
		$current_states = $1;
		say "DEBUG: PF reports $current_states state entries in use." if $debug;
		last;
	}
	say "DEBUG: State entry count not found in : $_" if $debug;
}
unless ( defined $current_states ) { die "Could not determine the
current number of pf states. Cannot continue.\n" }

# Calculate some percentages that we may use more than once
my $p_current	= ($current_states/$state_limit);
my $p_warn	= ($adaptive_start/$state_limit)-0.1; # Set warning at 10%
below the lower bound for adaptive timeouts
my $p_crit	= ($adaptive_start/$state_limit);
say "DEBUG: Utilization stats (States/Warn/Critical):
$p_current/$p_warn/$p_crit (x 100%)" if $debug;

# Do something intelligent with the information
if ( $current_states >= $state_limit*$p_crit ) {
	# Critical state
	say "DEBUG: CRITICAL state alarm. $current_states (%" .
$p_current*100 . ") states in use." if $debug;
	say "$current_states (%" . $p_current*100 . ")";
	exit $ERRORS{'CRITICAL'}
} elsif ( $current_states >= $state_limit*$p_warn ) {
	# Warn state
	say "DEBUG: WARN state alarm. $current_states (%" . $p_current*100 .
") states in use." if $debug;
	say "$current_states (%" . $p_current*100 . ")";
	exit $ERRORS{'WARN'}
} else {
	# Normal state
	say "DEBUG: NORMAL status. $current_states (%" . $p_current*100 . ")
states in use." if $debug;
	say "$current_states (%" . $p_current*100 . ")";
	exit $ERRORS{'OK'}
}

Thank you,

-- 
Thomas Johnson
ClaimLynx, Inc.

------------------------------------------------------------------------------
Start uncovering the many advantages of virtual appliances
and start using them to simplify application deployment and
accelerate your shift to cloud computing.
http://p.sf.net/sfu/novell-sfdev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list