Small patch for check_nrpe.c

Andreas Ericsson ae at op5.se
Fri Sep 1 14:39:34 CEST 2006


Mark Plaksin wrote:
> Andreas Ericsson <ae at op5.se> writes:
> 
>> Mark Plaksin wrote:
>>
> 
> Sorry if I wasn't clear about which end was which.  "Server" means box
> running Nagios and client is the ... client to which Nagios connects.  On
> our client, connections remained in the LAST_ACK state for over 3 minutes.
> There is probably an ndd parameter which could be tuned but I have not
> found it yet.
> 
>>> That's not so bad in itself but we were unlucky enough to have our server
>>> (Debian stable box running a 2.6 kernel) attempt a new connection to the
>>> same client using the same source port!   The client thought it was already
>>> talking to the server on that port so it didn't play along and
>>> check_nrpe_ssl on the server timed out.
>>>
>> I'm a bit confused about your terms here. Is the "client" the host 
>> running Nagios? It would seem so above, since you say the server end 
>> ungracefully closes the connection. Closing the socket is done by 
>> check_nrpe after it has received all the data from nrpe, so if it goes 
>> the other way around you've got something funny going on indeed.
>>
>> Otoh, if the "client" really is the host running Nagios, your second 
>> statement doesn't make much sense, as check_nrpe initiates the 
>> connection too.
>>
>> Either way, check_nrpe will always connect to the same port, and no port 
>> can ever be used twice for connecting to somewhere else (check 
>> linux/net/ipv4/inet_connection_sock.c, the function inet_csk_get_port).
> 
> Maybe this description will make more sense:
> 
> 1)  Nagios box (Debian) makes a connection to client with a randomly selected source
>     port of, say, 30000 and a destination port of 5666 (the NRPE port) on
>     the client (HP-UX).
> 2)  A normal conversation occurs and the Nagios end sends RST (a result of
>     close(sd)) to the client.
> 3)  Nagios box believes the connection is over and moves on.  *Something*
>     makes the client box believe the connection is still alive and it
>     continues to believe this for several minutes.  'netstat -a' shows the
>     connection in the LAST_ACK state all this time.  
> 
>     Jay's best guess is that the NAT device caused the trouble--it saw
>     enough to think the conversation was over and stopped sending packets.

It's possible, but it would have had to fail to forward the RST from 
Nagios -> other_host in order for the other machine to think the 
connection was still up. If, however, the RST packet got lost (busy 
line, perhaps?), then you'd get this exact situation.

Did you do the packet-trace between NAT -> HPUX and Nagios -> NAT at the 
same time, or only on one side?

>     And the HP-UX box (client) wanted to see more before the deciding the
>     connection was over.
> 4)  Nagios box makes a brand new connection to the same client machine and
>     by great coincidence uses the same source port--30000.
> 5)  The client box believes the two machines are already talking on this
>     port and doesn't allow the new connection to succeed.  The Nagios box
>     eventually gives up and check_nrpe says "timeout".
> 

Made it crystal clear :)

The really odd thing is that as per ipv4 rfc (794, I believe), the max 
timeout for a TCP connection is 120 seconds, which means that either the 
NAT device in between kept the connection alive for some seriously odd 
reason, or the HP-UX kernel is bugged/non-rfc-compliant.

Btw, Linux uses ports ~45000 up to ~65000 in a round-robin manner, so on 
a system with 20000 outbound connection attempts in the interval you 
have between each check towards the failing system, you'll end up in the 
rough neighbourhood of the same port-number. Some checks initiate more 
than one connection, so for a busy Nagios server this isn't an unlikely 
scenario.

>> This patch doesn't hurt anything, although I'm still curious as to why 
>> it's needed, since it seems like there's some horribly odd bug somewhere 
>> in the dark recesses of one system's kernel, or your network setup 
>> (although that's not too likely since this patch shouldn't really help 
>> that either).
> 
> It's needed because the recesses are dark, it can't hurt, and it solved our
> problem! :)  Seriously, we believe that we could track down the root cause
> of our problem but it might take a lot of time and energy and the patch is
> doing the trick for us.  It's conceivable that other people have similar
> problems and the patch will help them too.  It was a challenge to catch the
> problem in the act, analyze the network trace, and come up with a
> solution.
> 
> And yes it would be very best if we did fix the root problem--it might be
> affecting other things.  At the moment it's not, the set of client machines
> having the problem is small and hard (for non technical reasons) to patch,
> etc.
> 

Ah well. At least you're forewarned if similar strange things happen, 
which can only be a Good Thing(tm).

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list