NRPE: Unable to read output; but works when run under strace ...

Florian Ernst florian_ernst at gmx.net
Mon Oct 8 20:31:21 CEST 2012


Hello all,

given a fairly well-running monitoring setup with about 18k services I
thought I had understood the basics. However, the following leaves me
clueless, and I hope I'm merely missing something obvious here:

On an up-to-date Debian Squeeze (i386) OpenVZ guest I have established
that my monitoring user can execute a given command:

root at vserv08:/# sudo -u monitor -i /usr/lib/nagios/plugins/check_dummy 0 success; echo Exitcode: $?
OK: success
Exitcode: 0

So far, so good. Now entering NRPE, using a stripped-down config for
illustrating the point:

root at vserv08:/# grep -v -e '^$' -e '^#' /etc/nagios/nrpe.cfg
debug=1
nrpe_user=monitor
nrpe_group=monitor
allowed_hosts=127.0.0.1
command[dummy]=/usr/lib/nagios/plugins/check_dummy 0 success

root at vserv08:/# ps auxww | grep '[/]usr/sbin/nrpe'
monitor   7215  0.0  0.1   3704   892 ?        Ss   15:20   0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -d

The process startup logged as follows:

Oct  8 15:20:22 vserv08 nrpe[7214]: Added command[dummy]=/usr/lib/nagios/plugins/check_dummy 0 success
Oct  8 15:20:22 vserv08 nrpe[7214]: INFO: SSL/TLS initialized. All network traffic will be encrypted.
Oct  8 15:20:22 vserv08 nrpe[7215]: Starting up daemon
Oct  8 15:20:22 vserv08 nrpe[7215]: Listening for connections on port 5666
Oct  8 15:20:22 vserv08 nrpe[7215]: Allowing connections from: 127.0.0.1

However, executing the dummy command won't work:

root at vserv08:/# /usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c dummy
NRPE: Unable to read output

This has been logged as:

Oct  8 15:21:36 vserv08 nrpe[7234]: Connection from 127.0.0.1 port 48791
Oct  8 15:21:36 vserv08 nrpe[7234]: Host address is in allowed_hosts
Oct  8 15:21:36 vserv08 nrpe[7234]: Handling the connection...
Oct  8 15:21:36 vserv08 nrpe[7234]: Host is asking for command 'dummy' to be run...
Oct  8 15:21:36 vserv08 nrpe[7234]: Running command: /usr/lib/nagios/plugins/check_dummy 0 success
Oct  8 15:21:36 vserv08 nrpe[7234]: Command completed with return code 2 and output:
Oct  8 15:21:36 vserv08 nrpe[7234]: Return Code: 2, Output: NRPE: Unable to read output
Oct  8 15:21:36 vserv08 nrpe[7234]: Connection from 127.0.0.1 closed.

This strikes me as weird: nrpe tries to execute the defined command, but
somehow no output shows up. I know of the peculiarities that might arise
once sudo joins the team or when permissions aren't set appropriately,
but this doesn't apply here.

Playing around with the dummy command (substituting a shell script,
sprinkling '| tee -a logfile' into the code, ...) revealed that indeed
the desired text output is generated but somehow gets discarded. Perhaps
the monitoring user or even the whole system is subtly broken, but given
that there are ~400 similiarily setup systems (all using the same
workflow/automatisms for deploying the monitoring infrastructure) I was
starting to wonder how that might have happened ...

However, it got weirder: if I strace the nrpe process, everything works
as desired:

root at vserv08:/# strace -f -o /root/log -p 7215

And then in another terminal:

root at vserv08:/# /usr/lib/nagios/plugins/check_nrpe -H 127.0.0.1 -c dummy
OK: success

Logged as follows:

Oct  8 15:21:57 vserv08 nrpe[7240]: Connection from 127.0.0.1 port 37275
Oct  8 15:21:57 vserv08 nrpe[7240]: Host address is in allowed_hosts
Oct  8 15:21:57 vserv08 nrpe[7240]: Handling the connection...
Oct  8 15:21:57 vserv08 nrpe[7240]: Host is asking for command 'dummy' to be run...
Oct  8 15:21:57 vserv08 nrpe[7240]: Running command: /usr/lib/nagios/plugins/check_dummy 0 success
Oct  8 15:21:57 vserv08 nrpe[7240]: Command completed with return code 0 and output: OK: success
Oct  8 15:21:57 vserv08 nrpe[7240]: Return Code: 0, Output: OK: success
Oct  8 15:21:57 vserv08 nrpe[7240]: Connection from 127.0.0.1 closed.

I found no further hints in the strace log, but this led me to assume
that there is some NRPE weirdness involved, and thus I'm writing here
instead of further digging through the system.

Any ideas?

Cheers,
Flo

------------------------------------------------------------------------------
Don't let slow site performance ruin your business. Deploy New Relic APM
Deploy New Relic app performance management and know exactly
what is happening inside your Ruby, Python, PHP, Java, and .NET app
Try New Relic at no cost today and get our sweet Data Nerd shirt too!
http://p.sf.net/sfu/newrelic-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list