Small patch for check_nrpe.c

Andreas Ericsson ae at op5.se
Fri Sep 1 10:15:18 CEST 2006


Mark Plaksin wrote:
> Here's a small patch which makes check_nrpe close the socket gracefully
> when it's done.  This resolved a problem we were having with spurious
> timeouts.  We've been running it on our production Nagios instance (200
> hosts, 5000 services; most services use NRPE) for a week and it's working
> great.
> 
> Before the patch, check_nrpe_ssl was timing out when trying to connect to
> hosts that were definitely up.  A local expert (Jay Cotton) looked at our
> sniffer trace, explained the problem, and offered a fix.  The server end
> was "ungracefully" closing the socket connected to the client.  For some
> reason (NAT device in the middle, TCP stack on the HP-UX 11.11 client,
> or?), the client thought the connection was still open.  The client
> continues saying FIN after the server has sent RST.  The client keeps the
> connection in the LAST_ACK state for several minutes.
> 

120 seconds, I guess, since that's the max tcp timeout. For the 
archives, this is configurable via a sysctl to net.ipv4.tcp_fin_timeout 
on Linux. A setting of 30 works without problems and prevents 
long-lasting FIN-sessions.

> That's not so bad in itself but we were unlucky enough to have our server
> (Debian stable box running a 2.6 kernel) attempt a new connection to the
> same client using the same source port!   The client thought it was already
> talking to the server on that port so it didn't play along and
> check_nrpe_ssl on the server timed out.
> 

I'm a bit confused about your terms here. Is the "client" the host 
running Nagios? It would seem so above, since you say the server end 
ungracefully closes the connection. Closing the socket is done by 
check_nrpe after it has received all the data from nrpe, so if it goes 
the other way around you've got something funny going on indeed.

Otoh, if the "client" really is the host running Nagios, your second 
statement doesn't make much sense, as check_nrpe initiates the 
connection too.

Either way, check_nrpe will always connect to the same port, and no port 
can ever be used twice for connecting to somewhere else (check 
linux/net/ipv4/inet_connection_sock.c, the function inet_csk_get_port).

The only way the same port could have been used twice is that if it was 
specifically bound to one port using the bind() system call on a socket 
that has been set to reuse ports with the SO_REUSEADDR option to 
setsockopt(). The port being bound to must be in FIN_WAIT to be eligible 
for reusing. If that's what's happening (it's not in check_nrpe), you've 
found either a bug in the kernel or a program whose author should be 
either shot or educated, depending on preference.

> Closing the connection gracefully eliminated the problem.  Below is Jay's
> note describing his fix.
> 
> Thanks!
> 
> ------------------------------------------------------------------------------
> Find the line that reads "close(sd)" a few lines after the line that reads
> "/* close the connection */". BTW, you'll notice the close() command is
> listed in the source code a couple of times below this. Technically those
> shouldn't be there since the connection will already be closed...a small
> programming bug, but one that isn't going to affect us.
> 
> Although using close() can work, it usually results in a RST being sent
> because the program exits before reading all data or getting the FIN from
> the remote. For a graceful close you need to wait until receiving the FIN
> from the remote before issuing the close() command. To do this requires
> cooperation from the remote, but in most cases isn't a problem (sending the
> FIN will cause the other end of the connection to close).
> 
> Here's what you're supposed to do:
> 
> 1. use the shutdown() command to send a FIN: shutdown(sd, SHUT_WR)
> 2. use select() and recv() to process incoming data from remote (actual
> data can be ignored). When the remote closes, the recv() command will
> return 0, indicating a graceful close. The select() command is needed to
> make sure recv() doesn't block indefinitely...allowing you to put an upper
> limit on how long to wait. After all, the remote may decide not to close
> the connection gracefully.
> 3. Finally, call close() and continue processing normally. At this point,
> both ends of the connection are closed properly and calling close() merely
> releases the resources we allocated for that socket.
> 
> Here's a function you can add to the code that accomplishes this task:
> 
> void graceful_close(int sd, int timeout)
> {
>         fd_set in;
>         struct timeval tv;
>         char buf[1000];
> 
>         shutdown(sd, SHUT_WR);  // Send FIN packet
>         for ( ; ; ) {
>                 FD_ZERO(&in);
>                 FD_SET(sd, &in);
>                 tv.tv_sec = timeout / 1000;
>                 tv.tv_usec = (timeout % 1000) * 1000;
>                 if (1 != select(sd + 1, &in, NULL, NULL, &tv)) break;   //
> timeout or error
>                 if (0 >= recv(sd, buf, sizeof(buf), 0)) break;  // no more
> data (FIN or RST)
>         }
>         closesocket(sd);
> }
> 
> Instead of calling close(sd) we'll call graceful_close(sd, 5000) to wait up
> to 5 seconds (5000 milliseconds) for the remote to close before aborting
> the connection. This should fix the problem...I think. :)
>

This patch doesn't hurt anything, although I'm still curious as to why 
it's needed, since it seems like there's some horribly odd bug somewhere 
in the dark recesses of one system's kernel, or your network setup 
(although that's not too likely since this patch shouldn't really help 
that either).

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list