Trouble with Nag-1.0/ePN/check_by_ssh: check returns UNKNOWN status _in_ Nagios.

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Sat Feb 22 10:00:05 CET 2003


Dear Ladies and Gentlemen,

I am writing to seek advice about how to deal with an intermittent
problem with check_by_ssh (a version from the CVS, <= 10 days
ago: check_by_ssh (nagios-plugins 1.3.0-beta2) 1.9).

The context:

 Nagios 1.0/ePN/Perl 5.005_03
 FreeBSD 4.7_RELEASE
 PIII 850 + 256 MB . 192 hosts + 309 active checks. Load average <=
0.20. Latency <= 16 secs.

The problem:

 check_by_ssh is connecting to an AIX v4 host to run a priviledged
/bin/sh script that checks Oracle database 'connectivity' (probably
using sql+). The check is coded according to Nag guidelines and its
author assures me it only exits with 0 and 2 (no warning, no unknown).
This check is run by sudo on AIX, so the complete check_by_ssh command
is

%/usr/local/nagios/libexec/check_by_ssh -t 60 -H oradev -C
'/usr/local/bin/sudo -u netstmq /home/local/netsaint/db_check/db_check
2>/dev/null'
all databases ok
%echo $?
0

services.cfg:

define service{
        use                             generic-service

        host_name                       oradev
        service_description             DB Connectivity
        contact_groups                  oracle-admins
        normal_check_interval           30
        check_command
check_by_ssh4!60!/usr/local/bin/sudo -u netstmq
/home/local/netsaint/db_check/db_check 2>/dev/null
        }

checkcommands.cfg:

# 'check_by_ssh4' command definition
define command{
        command_name    check_by_ssh4
        command_line    $USER1$/check_by_ssh -t $ARG1$ -H $HOSTNAME$ -C
'$ARG2$'
#       command_line    $USER1$/check_by_ssh -t $ARG1$ -H pc09011 -C
'$ARG2$'
        }


Now, I have never seen it return (running the command above from the
Nagios host CLI logged in as the Nagios user) other than OK and CRITICAL
return codes, yet Nagios as I write, reports that the return code is
UNKNOWN completely contradicting what I see from the CLI (above).

In addition, a -HUP signal to Nagios usually triggers an UNKNOWN state,
while a Nagios stop/start cycle is the only way to clear it.

A debugging Nagios (--enable_DEBUG3 plus the other usual configure
options) performs somewhat differently in that the rate of UNKNOWN
results is much less, and -HUP only produces an UNKNOWN from the first
check (recovers on the first retry).

It also shows

        Found check result for service 'DB Connectivity' on host
'oradev'
                Check Type:    ACTIVE
                Parallelized?: Yes
                Exited OK?:    Yes
                Return Status: 3
                Plugin Output: 'all databases ok'
                Service 'DB Connectivity' on host 'oradev' has changed
state since last check!
                Raw Command: check-host-alive
                Processed Command: /usr/local/nagios/libexec/check_ping
10.0.100.10 100 100 5000.0 5000.0 -p 1
        Host Check Result: Host 'oradev' is UP
        Host Check Result: Host 'oradev' is UP
        Raw global service event handler command
line: $USER1$/global_svc_handler $TIMET$ $HOSTNAME$ '$SERVICEDESC$'
$SERVICESTATE$ $STATETYPE$ '$OUTPUT$'
        Processed global service event handler command
line: /usr/local/nagios/libexec/global_svc_handler 1045809853 oradev 'DB
Connectivity' UNKNOWN SOFT 'all databases ok'

It seems then that the check is occasionally returning UNKNOWN states,
without the check abending (and therefore giving Nag an opportunity to
set a default value).

Since check_by_ssh is used by this Nagios successfully for other
services, the only explanation I can think of is that 'sudo' is
malfunctioning in some way. 

Your comments about how to proceed are most welcome: gdb, truss ?

Yours sincerely.


-- 
------------------------------------------------------------------------
Stanley Hopcroft
------------------------------------------------------------------------

'...No man is an island, entire of itself; every man is a piece of the
continent, a part of the main. If a clod be washed away by the sea,
Europe is the less, as well as if a promontory were, as well as if a
manor of thy friend's or of thine own were. Any man's death diminishes
me, because I am involved in mankind; and therefore never send to know
for whom the bell tolls; it tolls for thee...'

from Meditation 17, J Donne.


-------------------------------------------------------
This SF.net email is sponsored by: SlickEdit Inc. Develop an edge.
The most comprehensive and flexible code editor you can use.
Code faster. C/C++, C#, Java, HTML, XML, many more. FREE 30-Day Trial.
www.slickedit.com/sourceforge
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list