Sporadic Communication Problems with an NRPE C lient

Carroll, Jim P [Contractor] jcarro10 at sprintspectrum.com
Fri Dec 27 16:46:50 CET 2002


It appears you have two problems here.  I can comment on the one, but am
unsure of the other.

I, too, suffered from sporadic NRPE problems in the past, such as what
you're describing.  The first time, I resolved it by moving Nagios from a
wheezy, underpowered Intel machine to something quite a bit more up-to-date.
I started experiencing it again as I continued to scale up the number of
hosts I was monitoring, and ended up doing some performance tuning of the
system.  This system is using RLX blade technology (www.rlx.com) with RH
7.2.  A co-worker had configured the blade to use the disk mirroring option
that RLX supplies (could possibly be LVM) and that was just beating up on
the I/O.  So we captured the image and redeployed it without the mirroring.
We also bumped up the swap from 500MB (1/2 of the 1GB RAM) to 2GB (which
might be overkill).  There were other symptoms prior to the tuning; I had
bumped up the number of service checks from about 800 to about 1100, which
was also causing the system to suffer a little too much (response lag
started to skyrocket).

I haven't had any NRPE problems of this nature since tuning the system.
However, I've already expressed my concern to the list in this regard,
pondering aloud whether NSCA is preferable for scaling to much larger
numbers, but alas, I received neither "YES, switch to NSCA" nor "NO, NRPE
will easily scale to [insert really large number here]".

I'm not entirely certain this is the best approach, but I think it would be
preferable if there were a way to tell check_nrpe to ignore connection
timeouts (ie, don't return 2).  It's useful to have at least one NRPE check
return a critical if it cannot connect with the NRPE daemon on the client,
but as I've discovered, the only way to avoid getting flooded with NRPE
alerts (eg, if the daemon can't/didn't start) is to define dependancies.  I
have a dummy NRPE check which merely does an "echo OK - NRPE is up" which of
course returns a 0, and all the other NRPE checks for that host are
dependant on that.  Works like a charm, but adds a significant bulk to
dependencies.cfg for each host.

As for your "status information not found", you might want to check your
logfiles for hints of the problem.

jc

> -----Original Message-----
> From: Kaplan, Andrew H. [mailto:AHKAPLAN at PARTNERS.ORG]
> Sent: Friday, December 27, 2002 8:27 AM
> To: 'nagios-users at lists.sourceforge.net'
> Subject: [Nagios-users] Sporadic Communication Problems with an NRPE
> Client
> 
> 
> I installed the nrpe client on a Red Hat 7.3 machine recently and I am
> encountering the following problem:
> 
> Nagios sporadically returns a critical error message 
> indicating that the
> connection has been refused by the
> host machine. Additionally, I cannot access the service 
> detail screen of the
> particular host. Every time that
> I try I get an Error: Status information not found message, 
> even though the
> services do appear on the previous
> web page.
> 
> Has anyone encountered this situation, and if so what have 
> they done to correct
> it? Thanks.
> 
> 
> -------------------------------------------------------
> This sf.net email is sponsored by:ThinkGeek
> Welcome to geek heaven.
> http://thinkgeek.com/sf
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> 


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf




More information about the Users mailing list