Issues with NSCA

Ken Snider ken.snider at datawire.net
Fri Jan 17 18:30:41 CET 2003


Carroll, Jim P [Contractor] wrote:

> What's your service_reaper_frequency set to?  If it's set to 10, lower it to
> 5.  Restart nagios and see what happens.

I did so, and it did not seem to have any noticeable effect.

I think, however, that I've found the problem.

I turned on debugging on nsca earlier to see if I could get more details..
What I found was that nsca seems to only be able to handle a few connections
on startup, then blocks until they finish before trying to handle the others.

The problem is, send_nsca times out after 10 seconds, and the trailing
connections don't actually get processed.

Here are all the timestamps. Look at the times the connections were "handled":

Jan 17 06:26:00 nms nsca[30165]: Handling the connection...
Jan 17 06:26:00 nms nsca[30166]: Handling the connection...
Jan 17 06:26:00 nms nsca[30167]: Handling the connection...
Jan 17 06:26:00 nms nsca[30168]: Handling the connection...
Jan 17 06:26:00 nms nsca[30169]: Handling the connection...
Jan 17 06:26:00 nms nsca[30170]: Handling the connection...
Jan 17 06:26:00 nms nsca[30171]: Handling the connection...
Jan 17 06:26:00 nms nsca[30172]: Handling the connection...
Jan 17 06:26:00 nms nsca[30173]: Handling the connection...
Jan 17 06:26:00 nms nsca[30174]: Handling the connection...
Jan 17 06:26:01 nms nsca[30175]: Handling the connection...
Jan 17 06:26:01 nms nsca[30176]: Handling the connection...
Jan 17 06:26:01 nms nsca[30177]: Handling the connection...
Jan 17 06:26:01 nms nsca[30178]: Handling the connection...
Jan 17 06:26:01 nms nsca[30179]: Handling the connection...
Jan 17 06:26:01 nms nsca[30180]: Handling the connection...
Jan 17 06:26:01 nms nsca[30181]: Handling the connection...
Jan 17 06:26:01 nms nsca[30182]: Handling the connection...
Jan 17 06:26:01 nms nsca[30183]: Handling the connection...
Jan 17 06:26:01 nms nsca[30184]: Handling the connection...
Jan 17 06:26:01 nms nsca[30185]: Handling the connection...
Jan 17 06:26:01 nms nsca[30186]: Handling the connection...
Jan 17 06:26:01 nms nsca[30187]: Handling the connection...
Jan 17 06:26:01 nms nsca[30188]: Handling the connection...
Jan 17 06:26:01 nms nsca[30189]: Handling the connection...
Jan 17 06:26:02 nms nsca[30197]: Handling the connection...
Jan 17 06:26:04 nms nsca[30207]: Handling the connection...
Jan 17 06:26:08 nms nsca[30227]: Handling the connection...
Jan 17 06:26:12 nms nsca[30256]: Handling the connection...
Jan 17 06:26:15 nms nsca[30295]: Handling the connection...
Jan 17 06:26:17 nms nsca[30299]: Handling the connection...
Jan 17 06:26:18 nms nsca[30317]: Handling the connection...
Jan 17 06:26:19 nms nsca[30320]: Handling the connection...
Jan 17 06:26:19 nms nsca[30321]: Handling the connection...
Jan 17 06:26:20 nms nsca[30325]: Handling the connection...
Jan 17 06:26:20 nms nsca[30326]: Handling the connection...
Jan 17 06:26:20 nms nsca[30327]: Handling the connection...
Jan 17 06:26:20 nms nsca[30328]: Handling the connection...
Jan 17 06:26:20 nms nsca[30329]: Handling the connection...
Jan 17 06:26:22 nms nsca[30346]: Handling the connection...

a ton are handled immediately, but then there's a pretty big hole where none
are. From what I can tell, the connections *are* handled more or less within
1-2 seconds - our time is in sync enough to have most boxes return data in
that time frame. That means, by :11 or :12 or so, send_nsca has timed out,
and those service checks will likely fail.

This theory is supported by the status of all the service checks run *after*
   :10.. as you'll see here:

Jan 17 06:26:02 nms nsca[30165]: End of connection or could not read request
from client...
Jan 17 06:26:12 nms nsca[30227]: End of connection or could not read request
from client...
Jan 17 06:26:13 nms nsca[30166]: End of connection or could not read request
from client...
Jan 17 06:26:14 nms nsca[30167]: End of connection or could not read request
from client...
Jan 17 06:26:14 nms nsca[30168]: End of connection or could not read request
from client...
Jan 17 06:26:15 nms nsca[30169]: End of connection or could not read request
from client...
Jan 17 06:26:15 nms nsca[30256]: End of connection or could not read request
from client...
Jan 17 06:26:15 nms nsca[30170]: End of connection or could not read request
from client...
Jan 17 06:26:16 nms nsca[30171]: End of connection or could not read request
from client...
Jan 17 06:26:16 nms nsca[30172]: End of connection or could not read request
from client...
Jan 17 06:26:17 nms nsca[30173]: End of connection or could not read request
from client...
Jan 17 06:26:17 nms nsca[30174]: End of connection or could not read request
from client...
Jan 17 06:26:17 nms nsca[30175]: End of connection or could not read request
from client...
Jan 17 06:26:17 nms nsca[30176]: End of connection or could not read request
from client...
Jan 17 06:26:17 nms nsca[30177]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30178]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30179]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30180]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30181]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30182]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30183]: End of connection or could not read request
from client...
Jan 17 06:26:18 nms nsca[30184]: End of connection or could not read request
from client...
Jan 17 06:26:19 nms nsca[30295]: End of connection or could not read request
from client...
Jan 17 06:26:19 nms nsca[30185]: End of connection or could not read request
from client...
Jan 17 06:26:19 nms nsca[30186]: End of connection or could not read request
from client...
Jan 17 06:26:19 nms nsca[30187]: End of connection or could not read request
from client...
Jan 17 06:26:19 nms nsca[30188]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30299]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30189]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30317]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30320]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30321]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30197]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30325]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30326]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30327]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30328]: End of connection or could not read request
from client...
Jan 17 06:26:20 nms nsca[30329]: End of connection or could not read request
from client...
Jan 17 06:26:21 nms nsca[30207]: End of connection or could not read request
from client...
Jan 17 06:26:22 nms nsca[30346]: End of connection or could not read request
from client...

Now, note that this message indicates nothing really, a successful and an
unsuccessful run ends with the same message. BUT, pick a PID *after* :10 and
find its' termination time above, and compare that with a PID *before* :10.
Those connections can take up to 20 seconds to close, but the latter ones
close within the first second (assumably because the client nsca has hung up).

Changing the send_nsca timeout to 60 seconds *completely* eliminated the 
problem. However, that now poses an even *more* confusing question, namely, 
why nsca, running from xinetd, did *not* handle the connections at once, as 
it should have. As I said before, xinetd is setup to allow for 200 
concurrent connections, and there are no other services it is handling, so 
the issue was not xinetd refusing the connections, but NSCA not handling them.

I think we're up against some sort of resource limitation, either in nagios, 
Linux, or nsca(d), since, in theory, xinetd should happily spawn up to 200 
nsca processes to handle the incoming load, but it appears nsca itself won't 
actually process that many.

-- 
Ken Snider
Senior Systems Administrator
Datawire Communication Networks Inc.



-------------------------------------------------------
This SF.NET email is sponsored by: Thawte.com - A 128-bit supercerts will
allow you to extend the highest allowed 128 bit encryption to all your 
clients even if they use browsers that are limited to 40 bit encryption. 
Get a guide here:http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0030en




More information about the Users mailing list