Performance tuning a host returning resultsvia NSCA

Frost, Mark {PBG} mark.frost1 at pepsi.com
Fri Mar 7 17:26:18 CET 2008


 

>-----Original Message-----
>From: nagios-users-bounces at lists.sourceforge.net 
>[mailto:nagios-users-bounces at lists.sourceforge.net] On Behalf 
>Of Oliver Hookins
>Sent: Thursday, March 06, 2008 11:57 PM
>To: nagios-users at lists.sourceforge.net
>Subject: Re: [Nagios-users] Performance tuning a host 
>returning resultsvia NSCA
>
>On Thu Mar 06, 2008 at 22:39:41 -0600, Marc Powell wrote:
>>
>>On Mar 6, 2008, at 8:17 PM, Oliver Hookins wrote:
>>
>>> Hi all,
>>>
>>> I guess performance is a constant problem for everyone but 
>what I'm  
>>> seeing
>>> doesn't seem to make sense. I have two servers running Nagios, one  
>>> that is
>>> more or less just a frontend and another doing the checks and  
>>> returning
>>> results via send_nsca. Constantly I see the frontend light up with  
>>> criticals
>>> due to the passive results not being received in time (the service
>>> freshness timeout is 120 seconds).
>>
>>How many and are they always the same? What version of nagios?
>
>The host doing the checks is 2.10 and the frontend is 2.6. There are
>anything from just one critical service to dozens. The freshness timer
>expires then I have a dummy active check which always returns 
>critical and
>mentions something about the freshness timer expiring.
>
>The actual services that return critical in these cases are always
>different.
>
>>>
>>>
>>> I only have 120 passive service checks and 45 passive host checks,  
>>> so I can
>>> assume if none of the hosts are down it is only doing one 
>check per  
>>> second.
>>
>>Nagios doesn't distribute them that evenly but it tries to. I assume  
>>that you haven't done anything to prevent the parallelization of  
>>service checks. Is your normal_check_interval exactly 120 
>seconds? If  
>>so, you'll have some checks that happen at that time and 
>their results  
>>would be received by the central host after your freshness timeout.  
>>Also, do you have a lot of host volatility?
>
>normal_check_interval is 60 seconds for all service checks. 
>Host volatility
>is pretty low, if any. In fact most of the hosts this system 
>monitors are
>very stable.
>
>>I'd also check communication between the remote and central. Try  
>>sending passive results manually from the command line to make sure  
>>they complete in a timely manner (should be fractions of a second).
>
>It's a WAN link with fairly high latency and low bandwidth, 
>but according to
>iptraf the amount of data being transferred is low anyway. I 
>guess this is
>what I was driving at in my original post - does Nagios only ever call
>send_nsca serially? If the service checks are done in parallel 
>I would have
>thought send_nsca would be called in parallel as well.
>
>-- 
>Regards,
>Oliver Hookins
>Anchor Systems

I had posted a message that seems along these lines back in January
(title "Problem with high latencies after going distributed").
Ultimately, we found that the time for sending OCSP and OCHP commands on
the distributed node was sending the check latencies through the roof on
the distributed node.  Every time Nagios thought it needed to run the
command that did the send_nsca, it saw that it might take 10 seconds or
so and that kept pushing the service and host times back further and
further.  Eventually I think we had some latencies of 9000 seconds on
the distributed nodes.  When we ran this same set of services on a
non-distributed system, I think we had latencies of a second.

And in our case as well, the central server saw timeouts due to
fresheness checking and almost every check was done actively from the
central server.

The generalized solution to this is to have the ocsp/ochp commands run
as quickly as possible -- almost instantly if you can do that.  There
are already some solutions out there to do that.  A couple I saw
involved sending check data through a pipe (fifo) and having a separate
daemon process at the other end of the pipe to do the actual processing
of check data (send_nsca).  Nagios is then fooled into thinking that the
ocsp/ochp commmand did it job really quickly.

Ultimately, I did not like the fifo solution because I have to keep a
second process running properly (the fifo consumer) otherwise the fifo
will fill up and Nagios will lock up.  I ended up "rolling my own"
solution where I have Nagios write check data to a file and then have
another daemon-ish perl script grab the data every 10 seconds and pump
it up as a batch to the central server.  If the batch script dies for
some reason, then I have a pretty big buffer of time where I can just
remove the ocsp/ochp data file, but Nagios is never stopped on the
distributed node.  I now have sub-second latencies on my distributed
nodes.

I'm actually surprised that even as of Nagios 3.x, this is still a
problem (the speedy sending of distributed check data back to the
server) necessitating people to come up with their own unofficial
solutions.  The standard solution of just making a simple shell script
that runs send_nsca with the right arguments that the Nagios books
mention causes this problem if you're running any reasonable number of
checks from your distributed node.

Mark

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list