Dropped NSCA Packets? (WAS Re: Load issues with Nagios)

Ken Snider ksnider at datawire.net
Fri Jan 31 05:54:44 CET 2003


Carroll, Jim P [Contractor] wrote:
> Ah yes, so you did.  I think my brain was stuck on "the box is *always* at
> 100% CPU".
> 
> Have you tried truss/strace/whatever is appropriate for the o/s you use?

Interestingly, rebooting the box (and applying an uprev kernel) eradicated 
the issue, though, as is always the case with multiple variables, I am now 
unsure *which* of these things caused the "fix". I'll revisit this again 
should the symptoms recur.

And, for the sake of completeness, this was with 1.0.

Another interesting issue, however.

I've written a small wrapper that allows me to execute arbitrary plugins on 
a remote host, and "massage" the data (essentially add the hostname and 
integer error code) to the plugin output. This is combined with any other 
plugins running (newline between each) and piped to send_nsca.

This works wonderfully. On most of our systems, 5 plugins report every two 
minutes.

We have our "freshness" checking set to 8 minutes, or four iterations 
without a response form a given passive check (in reality, it is less than 
that, because of check/processing latencies, but should *more* than 
suffice.). Even with the freshness being set so high, I do notice services 
occasionally entering soft "unknown" states (the result of a script that 
runs when our services fail their freshness check).

Now, I see two possibilities here. First, congestion. Since we use NTP to 
sync our boxes, they *do* literally hammer the box within a second or two of 
each other. However, I have nsca spawned through xinetd, and the box seems 
to take the connections without issue. I also have send_nsca itself set to 
timeout at 30 seconds, which is more than enough time to process the 
results, as running nsca(d) in debug mode shows all results processed in 8 
seconds or so. So this possibility seems somewhat unlikely.

The second possibility is we're hitting some sort of limit in Nagios itself. 
  Our command_check_interval is set at -1, while our reaper frequency is 5 
seconds, so I don't think it's a pipe related issue (there are, perhaps, 5o 
servers that check in nearly simultaneously with about 1K of plugin data).

My question is twofold. First, has anyone else experienced this? And 
secondly, does anyone understand the inner workings of send_nsca 
sufficiently to explain to me how it deals with spurious network 
latency/packet loss or blocking issues? *should* it ever drop a connection 
other than when it reaches the 30 second timeout I've set?

-- 
Ken Snider
Senior Systems Administrator
Datawire Communication Networks Inc.



-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com




More information about the Users mailing list