Small patch for check_nrpe.c

Mark Plaksin happy at usg.edu
Fri Sep 1 14:54:55 CEST 2006


Andreas Ericsson <ae at op5.se> writes:

> Mark Plaksin wrote:
>
>> Andreas Ericsson <ae at op5.se> writes:
>> 
>>> Mark Plaksin wrote:
>>>
>> 
>> Sorry if I wasn't clear about which end was which.  "Server" means box
>> running Nagios and client is the ... client to which Nagios connects.  On
>> our client, connections remained in the LAST_ACK state for over 3 minutes.
>> There is probably an ndd parameter which could be tuned but I have not
>> found it yet.
>> 
>>>> That's not so bad in itself but we were unlucky enough to have our server
>>>> (Debian stable box running a 2.6 kernel) attempt a new connection to the
>>>> same client using the same source port!   The client thought it was already
>>>> talking to the server on that port so it didn't play along and
>>>> check_nrpe_ssl on the server timed out.
>>>>
>>> I'm a bit confused about your terms here. Is the "client" the host 
>>> running Nagios? It would seem so above, since you say the server end 
>>> ungracefully closes the connection. Closing the socket is done by 
>>> check_nrpe after it has received all the data from nrpe, so if it goes 
>>> the other way around you've got something funny going on indeed.
>>>
>>> Otoh, if the "client" really is the host running Nagios, your second 
>>> statement doesn't make much sense, as check_nrpe initiates the 
>>> connection too.
>>>
>>> Either way, check_nrpe will always connect to the same port, and no port 
>>> can ever be used twice for connecting to somewhere else (check 
>>> linux/net/ipv4/inet_connection_sock.c, the function inet_csk_get_port).
>> 
>> Maybe this description will make more sense:
>> 
>> 1)  Nagios box (Debian) makes a connection to client with a randomly selected source
>>     port of, say, 30000 and a destination port of 5666 (the NRPE port) on
>>     the client (HP-UX).
>> 2)  A normal conversation occurs and the Nagios end sends RST (a result of
>>     close(sd)) to the client.
>> 3)  Nagios box believes the connection is over and moves on.  *Something*
>>     makes the client box believe the connection is still alive and it
>>     continues to believe this for several minutes.  'netstat -a' shows the
>>     connection in the LAST_ACK state all this time.  
>> 
>>     Jay's best guess is that the NAT device caused the trouble--it saw
>>     enough to think the conversation was over and stopped sending packets.
>
> It's possible, but it would have had to fail to forward the RST from 
> Nagios -> other_host in order for the other machine to think the 
> connection was still up. If, however, the RST packet got lost (busy 
> line, perhaps?), then you'd get this exact situation.
>
> Did you do the packet-trace between NAT -> HPUX and Nagios -> NAT at the 
> same time, or only on one side?

We actually mirrored all the ports involved (Nagios server plus 6 clients)
onto a single switch port and ran the trace on the mirrored port.  That
made it a bit hard to read in Ethereal's (uh, I mean Wireshark's!) GUI.
Wireshark thought there were lots of retransmissions and the like because
it was seeing the same packet as it went out the server's port and into the
client's port.  Had I been thinking I might have run two traces--one of the
server's port and one of the clients' ports.

But Jay was able to make sense of the trace and come up with a theory and a
solution :)  So we didn't redo the trace.  I still have it and could send
the snippet that shows the problem.  I don't remember the exact details
(like whether the RST is what got dropped).

>>     And the HP-UX box (client) wanted to see more before the deciding the
>>     connection was over.
>> 4)  Nagios box makes a brand new connection to the same client machine and
>>     by great coincidence uses the same source port--30000.
>> 5)  The client box believes the two machines are already talking on this
>>     port and doesn't allow the new connection to succeed.  The Nagios box
>>     eventually gives up and check_nrpe says "timeout".
>> 
>
> Made it crystal clear :)
>
> The really odd thing is that as per ipv4 rfc (794, I believe), the max 
> timeout for a TCP connection is 120 seconds, which means that either the 
> NAT device in between kept the connection alive for some seriously odd 
> reason, or the HP-UX kernel is bugged/non-rfc-compliant.
>
> Btw, Linux uses ports ~45000 up to ~65000 in a round-robin manner, so on 
> a system with 20000 outbound connection attempts in the interval you 
> have between each check towards the failing system, you'll end up in the 
> rough neighbourhood of the same port-number. Some checks initiate more 
> than one connection, so for a busy Nagios server this isn't an unlikely 
> scenario.

Jay said that 2.4 kernels use round-robin and 2.6 kernels select ports
randomly (from a given range).  I couldn't find this explicitly stated
anywhere but I probably wasn't searching/reading the right way/thing.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list