Problem with high latencies after going distributed

Thomas Guyot-Sionnest dermoth at aei.ca
Fri Jan 25 04:03:53 CET 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 24/01/08 05:13 PM, Frost, Mark {PBG} wrote:
>  
> 
>> -----Original Message-----
>> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca] 
>>> I don't know how many people use OCP_Daemon but I had reports 
>>>> >from a few
>>> people that greatly reduced their latency using it and I 
>>> haven't had any
>>> bug reported yet. I believe it's well documented as well, but If you
>>> have any feedback on this I'll be happy to get it.
>>>
>>>> I'm playing with it a bit and have so far had good results. 
>> I'll have
>>>> some
>>>> feedback after I've played with it a bit longer.  Thanks 
>> for writing it
>>>> and
>>>> writing up the docs for it as well!
>> Pass the thanks over to Ethan who sent me a Nagios NSA t-shirt 
>> for it ;)
>>
>> Thomas
> 
> I can see that using the OCP Daemon script cut down on my latencies
> quite a lot.  Unfortunately,
> I'm still seeing some "stale" checks on the master server that I can't
> explain.  I'm starting to
> get the feeling that going distributed isn't all it's cracked up to be.
> I haven't seen mention in
> the docs of the caveats with oc[sh]p and latencies (my books sure don't
> mention it) and even the
> fact that the supplied submit_service_check script in the distribution
> from Ethan is a shell
> script that pipes to send_nsca.  I'm not all that excited about having
> to do a workaround
> for this issue.

Although I haven't been able to reproduce it on a test setup recently,
some time ago when we had a few passive checks running with Cron on all
servers we had problems with checks going stale. All the crons running
at the same time flodded Nagios with data and caused this behavior. I
didn't find the root cause at that time, but I found out that
artificially filling up the pipe while the checks were coming back in
clearly showed up the problem, although all NSCA were successfully
writing to the pipe as Nagios was reading it. I also had much less
passive checks than the default 1024 buckets in the Nagios buffer, to
this couldn't have been the issue. I reported it on Nagios-devel at that
time but haven't got any follow-ups.

I'm not sure if it only affects specific archs or have been fixed since
then, but that could be the problem you're having. OTOH the tests I ran
with OCP_Daemon were pretty hardcore, feeding it with hundreads of megs
of random check results as fast as possible, in different mode of
operation, and checking if all check results came back from the NSCA
daemon, so I'd be surprised it would have anything to do with that.

What you could try is:

1. Shutting off OC[HS]P on all but one server
2. Running -r0 to try to avoid batched results to send_nsca.

And check if it has any impact. Also make sure you have
command_check_interval=-1 in nagios.cfg on the central server.

BTW which version of Nagios and NSCA are you using?

Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHmVGZ6dZ+Kt5BchYRApudAJ48S8n2X24+iRbp6zNcoFTbgSOPVwCfS9IO
Otcnvi7iI6ACqOgiN/Q6ONI=
=m8rQ
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list