Small patch for check_nrpe.c

Mark Plaksin happy at usg.edu
Fri Sep 1 14:19:08 CEST 2006


Andreas Ericsson <ae at op5.se> writes:

> Mark Plaksin wrote:
>
>> Here's a small patch which makes check_nrpe close the socket gracefully
>> when it's done.  This resolved a problem we were having with spurious
>> timeouts.  We've been running it on our production Nagios instance (200
>> hosts, 5000 services; most services use NRPE) for a week and it's working
>> great.
>> 
>> Before the patch, check_nrpe_ssl was timing out when trying to connect to
>> hosts that were definitely up.  A local expert (Jay Cotton) looked at our
>> sniffer trace, explained the problem, and offered a fix.  The server end
>> was "ungracefully" closing the socket connected to the client.  For some
>> reason (NAT device in the middle, TCP stack on the HP-UX 11.11 client,
>> or?), the client thought the connection was still open.  The client
>> continues saying FIN after the server has sent RST.  The client keeps the
>> connection in the LAST_ACK state for several minutes.
>> 
>
> 120 seconds, I guess, since that's the max tcp timeout. For the 
> archives, this is configurable via a sysctl to net.ipv4.tcp_fin_timeout 
> on Linux. A setting of 30 works without problems and prevents 
> long-lasting FIN-sessions.

Sorry if I wasn't clear about which end was which.  "Server" means box
running Nagios and client is the ... client to which Nagios connects.  On
our client, connections remained in the LAST_ACK state for over 3 minutes.
There is probably an ndd parameter which could be tuned but I have not
found it yet.

>> That's not so bad in itself but we were unlucky enough to have our server
>> (Debian stable box running a 2.6 kernel) attempt a new connection to the
>> same client using the same source port!   The client thought it was already
>> talking to the server on that port so it didn't play along and
>> check_nrpe_ssl on the server timed out.
>> 
>
> I'm a bit confused about your terms here. Is the "client" the host 
> running Nagios? It would seem so above, since you say the server end 
> ungracefully closes the connection. Closing the socket is done by 
> check_nrpe after it has received all the data from nrpe, so if it goes 
> the other way around you've got something funny going on indeed.
>
> Otoh, if the "client" really is the host running Nagios, your second 
> statement doesn't make much sense, as check_nrpe initiates the 
> connection too.
>
> Either way, check_nrpe will always connect to the same port, and no port 
> can ever be used twice for connecting to somewhere else (check 
> linux/net/ipv4/inet_connection_sock.c, the function inet_csk_get_port).

Maybe this description will make more sense:

1)  Nagios box (Debian) makes a connection to client with a randomly selected source
    port of, say, 30000 and a destination port of 5666 (the NRPE port) on
    the client (HP-UX).
2)  A normal conversation occurs and the Nagios end sends RST (a result of
    close(sd)) to the client.
3)  Nagios box believes the connection is over and moves on.  *Something*
    makes the client box believe the connection is still alive and it
    continues to believe this for several minutes.  'netstat -a' shows the
    connection in the LAST_ACK state all this time.  

    Jay's best guess is that the NAT device caused the trouble--it saw
    enough to think the conversation was over and stopped sending packets.
    And the HP-UX box (client) wanted to see more before the deciding the
    connection was over.
4)  Nagios box makes a brand new connection to the same client machine and
    by great coincidence uses the same source port--30000.
5)  The client box believes the two machines are already talking on this
    port and doesn't allow the new connection to succeed.  The Nagios box
    eventually gives up and check_nrpe says "timeout".

> This patch doesn't hurt anything, although I'm still curious as to why 
> it's needed, since it seems like there's some horribly odd bug somewhere 
> in the dark recesses of one system's kernel, or your network setup 
> (although that's not too likely since this patch shouldn't really help 
> that either).

It's needed because the recesses are dark, it can't hurt, and it solved our
problem! :)  Seriously, we believe that we could track down the root cause
of our problem but it might take a lot of time and energy and the patch is
doing the trick for us.  It's conceivable that other people have similar
problems and the patch will help them too.  It was a challenge to catch the
problem in the act, analyze the network trace, and come up with a
solution.

And yes it would be very best if we did fix the root problem--it might be
affecting other things.  At the moment it's not, the set of client machines
having the problem is small and hard (for non technical reasons) to patch,
etc.



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list