Problem with high latencies after going distributed

Steve Shipway s.shipway at auckland.ac.nz
Wed Jan 23 04:05:14 CET 2008


> >> 	Active Service Latency:               0.000 / 7267.198 /
...
> >The only possible cause is the OCSP command slowing things
> >down somehow.
...
> But if the submit_check_result is running slowly, that would only
affect
> the service
> execution time wouldn't it?  My understanding of check latency is that
> it's the difference
> in time between when Nagios schedules a check to run versus the time
> that the check
> actually starts to execute.

If the scheduler gets behind, then the latency increases as it runs the
service checks in order of the scheduler.  It is possible that the OSCP
handler is run SERIALLY with service checks (as the host checks are done
in 1.x) and is therefore holding up service checks, just like you'd see
if you had a lot of down hosts and a long-running host check command.

> But maybe I'm misunderstanding something here.  When it comes to
working
> with Nagios, I tend to learn the most when I have the biggest problems


Don't we all :-/.  The latency effect of non-parallel host checks was a
nasty surprise to me.

> Do you do the same thing I mentioned where you define all the checks
on
> both distributed
> nodes, but disable checks on complimentary halves of those checks on
> each node?

Yes.  However, I can't always set the freshness checking because some of
our checks are every 4 hours, although most are at a sub 15min interval.
We have a complex configuration tool that builds our whole distributed
Nagios/MRTG configuration set from templates so I can't hand-hack the
config files either.

I have now set up one of our distributed nodes to batch the NSCA
messages, and will see if the latency increases overnight (so far, it
looks good).  To do this, I just changed submit_check_result to only
append to a file, then added a Nagios every-minute cronjob to cat the
contents of this file into send_nsca (actually, there are a few more
steps to ensure data integrity and checks, but that's basically it).
The upshot is that some checks may be delayed by up to a minute, and
we're dependent on cron, but the OCSP command exits very fast.

Let me know if you want a copy of the two scripts I used to achieve
this.

Steve

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list