Problem with high latencies after going distributed

Marc Powell marc at ena.com
Wed Jan 23 18:31:57 CET 2008



> -----Original Message-----
> From: nagios-users-bounces at lists.sourceforge.net [mailto:nagios-users-
> bounces at lists.sourceforge.net] On Behalf Of Frost, Mark {PBG}
> Sent: Tuesday, January 22, 2008 10:34 AM
> To: Nagios Users
> Subject: [Nagios-users] Problem with high latencies after going
> distributed
> 
> 
> 
> As I'd mentioned in a previous message, I'm in the process of
converting
> from a centralized
> Nagios 2.10 setup all running on a single host to a distributed setup
> running on at least 3
> hosts (3 to start anyway).  The centralized setup has 572 hosts and
2900
> services 99.9% of which are active checks.
> 

Not quite to that level here but probably comparable. I'm submitting
~1200 service checks every 5 minutes from my 'largest' remote Nagios to
two central boxen receiving a total of 3790 passive checks each every 5
minutes (for redundancy).

> 	Distributed Node 1                    (min/max/avg)
> 	Active Service Latency:               0.000 / 7267.198 /
> 4241.019 sec
> 	Active Service Execution Time:        0.000 / 60.014 / 0.651 sec
> 
> 	Distributed Node 2                    (min/max/avg)
> 	Active Service Latency:               0.000 / 11475.901 /
> 6393.641 sec
> 	Active Service Execution Time:        0.000 / 60.018 / 0.593 sec
> 
> Wow.

How many services are being polled/sent on each collector? My comparable
stats for the collector above are --

Active Service Latency:               0.001 / 10.390 / 2.385 sec
Active Service Execution Time:        0.089 / 47.674 / 1.274 sec

This isn't even a dedicated nagios box. It's also doing Cricket data
collection for 12831 rrd files at 5 minute intervals and other stuff. My
opinion is that unless there is some magic threshold that I haven't
crossed (I don't expect that there is), your numbers indicate some
network or configuration problem. 

Others have indicated that the OCSP execution may be an issue. Your OCSP
command should execute _very_ quickly so I don't see how it's a
significant factor at your levels unless there's a problem there,
especially when spreading out 2900 checks over 15 minutes. That's about
3 checks per second versus my 4 per second. For me to send results to
_2_ central boxes takes an insignificant amount of time --

$ time ./submit_check_result test test OK test
1 data packet(s) sent to host successfully.
1 data packet(s) sent to host successfully.

real    0m0.010s
user    0m0.000s
sys     0m0.010s

Even taking into account nagios setting up the call to
submit_check_result it's still trivial. Just making you aware that this
is testable by you and may be a red-herring.

--
Marc


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list