NRPE way too fragile ?

Matt Rivet matt.rivet at secure-24.com
Thu Oct 9 18:48:41 CEST 2008


I am running into the same issue.   I monitoring multiple windows hosts
using NRPE and NSClient++ on the windows machines.  I have turned the
time out up to 20 seconds and this has helped.  Currently I may see 3 or
4 time outs usually due to virus scans or back up jobs.  

Does anyone know any possible tweaks that can be done on the windows
machines (using NSClient++) to abolish or at least reduce time outs?  



-----Original Message-----
From: Mark Young [mailto:myoung at nagios.org] 
Sent: Wednesday, October 08, 2008 12:19 PM
To: Jayson Broughton
Cc: nagios-users at lists.sourceforge.net
Subject: Re: [Nagios-users] NRPE way too fragile ?

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Oct 8, 2008, at 10:06 AM, Jayson Broughton wrote:

> Guillaume,
>
> We actually have the same problem here where Nagios is setup.  I use  
> the
> NRPE daemon on both windows and linux servers.  Two of our servers do
> backups early in the morning and during that time we get NRPE Timeout
> messages from the two servers.  I have set the timeout on the server  
> and the
> clients to timeout after 30 seconds, thinking that would fix it.   
> But alas,
> we still get timeout messages.  I haven't had much time to see what  
> I can do
> to fix this, so for now our solution is to not email out warning  
> messages
> from those two servers (I have the thresholds set enough where the  
> critical
> messages still gives enough window to take care of the problem  
> before going
> too critical)  I have come to the conclusion that the servers are  
> running
> the backup and eating up so much processing that the nrpe times out  
> trying
> to connect and send information.
>
> If you find a solution or even an idea to try, feel free to let me  
> know and
> I'll give it a shot!


NRPE is normally setup with Xinetd listening for incoming  
connections.  When I say normally I mean if you followed the  
documentation. ;)  By default Xinetd has a low threshold of  
connections per instance in order to lessen the load on the server and  
prevent DDoS type attacks.  You can view what I mentioned in my  
previous post here:
http://article.gmane.org/gmane.network.nagios.user/56713

NRPE can be fine until Nagios decides to run many checks to the same  
host at the same time that hits the threshold..  For example when you  
go to the extended host information and "Schedule a check of all  
services on this host".

If you are not running it under Xinetd it may be other issues.   
Including system load and/or running into possible bugs with NPRE on  
your system.





>
>> -----Original Message-----
>> From: Guillaume Rousse [mailto:Guillaume.Rousse at inria.fr]
>> Sent: Wednesday, October 08, 2008 4:44 AM
>> To: nagios-users at lists.sourceforge.net
>> Subject: [Nagios-users] NRPE way too fragile ?
>>
>> Hello list.
>>
>> I'm using nrpe quite heavily for testing lots of local service on  
>> all my
>> machines. It work usually well, but seems a bit unreliable: too much
>> often, nrpe itself fails to accept incoming connections, and test  
>> fails:
>> CHECK_NRPE: Socket timeout after 10 seconds.
>>
>> stracing nrpe process shows it is probably waiting itself on another
>> connection:
>> [root at denfert ~]# strace -p 22444
>> Process 22444 attached - interrupt to quit
>> select(6, [5], NULL, [5], {0, 170000})  = 0 (Timeout)
>> accept(5, 0, NULL)                      = -1 EAGAIN (Resource
>> temporarily unavailable)
>>
>> It usually recovers itself alone, but that's enough to cause much
>> unwanted notifications, even if all monitored services have nrpe  
>> itself
>> as dependency. I'm using ssl encryption, as usually advised, but I'm
>> planning shifting to plain-text connection (everything occurs on a
>> distinc VLAN, without user access).
>>
>> Does everyone else has similar experience ?

You seem to be running it as a daemon process itself.  What system and  
NRPE version are you running?  What version of Nagios?  How many NRPE  
checks are you trying to perform in a given time?  You may see some  
benefit to enabling Nagios to spread its checks out more evenly,  
though this is simply covering the underlaying problem.  Maybe  
consider running it with the xinetd daemon and increase the number of  
allowed connections a second as a test.


Mark Young
___
Nagios Enterprises, LLC
Web:    www.nagios.com
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.8 (Darwin)

iEYEARECAAYFAkjs3WoACgkQ0KipU7WwlaWtXgCgs+77/o8Pyh0t/++FIbOEycgx
oiAAoMj2awwRG4HCernz7pcdf/K484Ca
=oKG0
-----END PGP SIGNATURE-----

------------------------------------------------------------------------
-
This SF.Net email is sponsored by the Moblin Your Move Developer's
challenge
Build the coolest Linux based applications with Moblin SDK & win great
prizes
Grand prize is a trip for two to an Open Source event anywhere in the
world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when
reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list