NRPE inconsistency

Carroll, Jim P [Contractor] jcarro10 at sprintspectrum.com
Fri Oct 25 23:38:26 CEST 2002


Synopsis:  check_nrpe behaves inconsistently at the command line, but in a
predictable way.

Environment:
- Nagios server:  RedHat 7.3
- NRPE client:  Solaris8
- plugin:  check_log2 (Perl version of check_log in contrib directory)

All other NRPE checks are working just fine.  But for 2 of my hosts, Nagios
is reporting problems in /var/adm/messages.

(Just for the sake of reference, the 2 hosts which aren't reporting
correctly are itdmln14 and itdmln15.)

Tests (apologies for the line wrap):

$ ../libexec/check_nrpe itdmln14 -c check_log_err
OK - No matches found.
$ ../libexec/check_nrpe itdmln15 -c check_log_err
(4): Oct 22 12:16:24 itdmln15 sshd[762]: [ID 800047 auth.error] error:
setsockopt SO_KEEPALIVE: Invalid argument
$ ../libexec/check_nrpe itdmln15 -c check_log_err
OK - No matches found.
$ ../libexec/check_nrpe itdmln14 -c check_log_err
(2): Oct 23 20:28:10 itdmln14 nfs: [ID 664466 kern.notice] NFS getattr
failed for server itdmln15: error 5 (RPC: Timed out)
$ ../libexec/check_nrpe itdmln14 -c check_log_err
OK - No matches found.

Snippet from nrpe.cfg which is used on ALL hosts:

command[check_log_err]=/home/nagios/libexec/check_log3 -l /var/adm/messages
-s /home/nagios/.messages_err.seek -p "ERR|Err|err|PANIC|Panic|panic" -n
"nrpe|uxw
dog|sprintnb tldd|sprintnb ltid|sprintnb tldcd"

(I copied check_log2 to check_log3 and modified it to return code 2 instead
of 1, so that it would be treated as critical instead of as a warning.)

So the essence of the problem is this:

- Testing itdmln14 and then testing it again, the 2nd test will come back
OK.
- Testing itdmln15 and then testing it again, the 2nd test will come back
OK.
- Testing itdmln14 and then testing itdmln15, the 2nd test will come back
with an error snarfed from /var/adm/messages on the target host (no matter
how stale the error is).
- Testing itdmln15 and then testing itdmln14, the 2nd test will come back
with an error snarfed from /var/adm/messages on the target host (no matter
how stale the error is).

I've observed that flipping back and forth between testing one of the
problem hosts and another arbitrary host (eg, itdmln14, itdmln13, itdmln14,
itdmln13), the results come back clean.

I'm not sure if this is meaningful, but here's some other info:

- Running NRPE as user 'nagios' on all clients, as a standalone daemon (not
from inetd)
- nrpe.cfg on both 14 and 15 is 12088 bytes long
- nrpe.cfg on 13 is 12418 bytes (ie, larger than the hosts in question)
- repeatedly running this command locally on 14 and 15 returns cleanly
each/every time:

/home/nagios/libexec/check_log3 -l /var/adm/messages -s
/home/nagios/.messages_err.seek -p "ERR|Err|err|PANIC|Panic|panic"

I don't have any empirical evidence, but I'm a wee bit suspicious of NRPE
itself (not the plugins) being the culprit.  Perhaps a pointer is way off in
a ditch somewhere.  :-/


Sorry for the long post.  I'm hoping that a fresh set of eyes will be able
to point me in the right direction.

jc


-------------------------------------------------------
This sf.net email is sponsored by: Influence the future 
of Java(TM) technology. Join the Java Community 
Process(SM) (JCP(SM)) program now. 
http://ads.sourceforge.net/cgi-bin/redirect.pl?sunm0004en




More information about the Users mailing list