Problem with high latencies after going distributed

Sean McAvoy smcavoy at ca.afilias.info
Fri Jan 25 00:26:18 CET 2008


Hi Mark,
I have been having similar problems with my distributed setup. The  
OCSP daemon reduced the latency in returning check results greatly,  
but I still am seeing (seemingly) random services go stale. I'm still  
try to track down the problem, recreating it on a small scale has so  
far been unsuccessful. I will let you (and the list) know how my  
investigations go.
As for determining the execution time of a particular check, this can  
be found in retention.dat. The field is check_execution_time=


On 24-Jan-08, at 5:13 PM, Frost, Mark {PBG} wrote:

>
>
>> -----Original Message-----
>> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca]
>> Sent: Thursday, January 24, 2008 3:33 AM
>> To: Frost, Mark {PBG}
>> Cc: Nagios Users
>> Subject: Re: [Nagios-users] Problem with high latencies after
>> going distributed
>>
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> Some heavily broken intending there (looks like my mail client gets
>> confused)... don't trust the number of ">"!
>>
>> On 23/01/08 10:47 PM, Frost, Mark {PBG} wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca]
>>>> Sent: Wednesday, January 23, 2008 10:24 PM
>>>> To: Frost, Mark {PBG}
>>>> Cc: Nagios Users
>>>> Subject: Re: [Nagios-users] Problem with high latencies after
>>>> going distributed
>>> I don't think so. I remember an email from Ton Voon some time
>>> ago asking
>>> Ethan why the oc[hs]p command are run serially but I don't recall if
>>> there was a reply or what else was said...
>>>
>>> I believe it's either documented in the official doc or some
>>> user-contributed doc that the oc[hs]p commands should return
>> as soon as
>>> possible. It's usually done in Perl using a fork:
>>>
>>> if (fork==0) {
>>>  # send stuff via NSCA here...
>>> }
>>> exit(0);
>>>
>>>
>>>> I guess what I'm thinking here is that unlike a custom
>> check, I can't
>>>> see most
>>>> people needing to customize the passive check result
>> process.  All the
>>>> solutions I've
>>>> seen seem to include a named pipe.  So why couldn't Nagios support
>>>> making the ocsp/ochp
>>>> "commands" just named pipes instead.   Then instead of a standalone
>>>> send_nsca binary,
>>>> have the nsca source build a send_nscaD binary (I'm making
>> that up) that
>>>> reads from the
>>>> pipe that nagios writes to and sends directly to nsca on the  
>>>> server.
>>>> That sort of
>>>> eliminates the middle-man in the process of reporting passive check
>>>> results.
>>>
>>>> I know, I know, I'm free to write the send_nscaD.c code and
>> send it to
>>>> Ethan :-)
>>
>> Well... I was thinking about partly re-writing nsca as an event-based
>> daemon (supporting only the --single mode, but that would be really
>> scalable) using libevent, allowing to pass along the timestamp
>> (this is
>> a recent feature request) and supporting multi-line responses (for
>> Nagios 3) in the process, and finally suggesting this as a base for a
>> NSCA v3... I'm not even sure if I would have enough time but since my
>> main objective it to learn I wouldn't loose anything trying :).
>>
>> In the unlikely event that I write it, In the same step I could  
>> surely
>> to a C version of OCP_Daemon supporting natively the "NSCA v3"  
>> protocol
>> (it wouldn't be hard)...
>>
>> I'll have to think about it... I quess the only sane separator to  
>> write
>> multiple multi-line results on a pipe would be \000 (NULL), so there
>> would be 3 mode of operation for send_nsca (and two for nsca_sendd
>> (don't you think it sounds better reversed?)):
>> send_nsca: compatible (v2 behavior), Single check (additional  
>> lines are
>> taken as additional output) and multi-check (NULL separated)
>> nsca_sendd: single-line (one check/line, OCP_Daemon style) and
>> multi-line "NULL-separated).
>>
>>> I don't know how many people use OCP_Daemon but I had reports
>>>>> from a few
>>> people that greatly reduced their latency using it and I
>>> haven't had any
>>> bug reported yet. I believe it's well documented as well, but If you
>>> have any feedback on this I'll be happy to get it.
>>>
>>>> I'm playing with it a bit and have so far had good results.
>> I'll have
>>>> some
>>>> feedback after I've played with it a bit longer.  Thanks
>> for writing it
>>>> and
>>>> writing up the docs for it as well!
>>
>> Pass the thanks over to Ethan who sent me a Nagios NSA t-shirt
>> for it ;)
>>
>> Thomas
>
> I can see that using the OCP Daemon script cut down on my latencies
> quite a lot.  Unfortunately,
> I'm still seeing some "stale" checks on the master server that I can't
> explain.  I'm starting to
> get the feeling that going distributed isn't all it's cracked up to  
> be.
> I haven't seen mention in
> the docs of the caveats with oc[sh]p and latencies (my books sure  
> don't
> mention it) and even the
> fact that the supplied submit_service_check script in the distribution
> from Ethan is a shell
> script that pipes to send_nsca.  I'm not all that excited about having
> to do a workaround
> for this issue.
>
> While the OCP_Daemon seems to help me, I'm a little uncomfortable
> running it as a solution to our
> issue.  First, we don't normally have root access on our boxes so
> recreating the FIFOs could be
> a problem (or at least a wait).  I'm also concerned about requiring
> another process external to
> Nagios as part of the process.  If OCP_Daemon dies at some point, my
> distributed nodes are hosed.
> I had a few issues with correctly starting Nagios and OCP_Daemon in  
> the
> right order when playing
> with it last night.  Once I got it all going, it worked well but I'm
> thinking of having to explain
> this to someone here who isn't the Nagios person.
>
> I was thinking of your fork/exec comment above.  What if one were to
> rewrite the "glue" shell
> script (the one that takes the output from Nagios and pipes it to
> send_nsca) and do something
> similar, but write it in C?  Additionally, have the parent fork and  
> exit
> (causing Nagios to
> think the oc[sh]p completed very quickly) then have the child go on  
> and
> send output to send_nsca
> separately.  For my setup, this has the advantage of not being a
> separate process that I need to
> make sure continues to run.  It also doesn't require synchronizing
> listeners on both ends of a pipe
> or else one process would hang.  It would almost be even better, it
> seems to me, if this script
> could do the send_nsca functionality (again, as the child) instead of
> even having to call send_nsca.
> The biggest drawback I can see there is that you can't edit the C
> program to show destination server,
> etc.  You'd just about have to pile on a ton of command line  
> options or
> have a config file for it.
>
> Just thinking out loud.
>
> On a related note, I see that according to my performance stats, some
> checks are still taking a
> very long time to run.  Is there some easy way I can see check  
> execution
> time per check and track
> down which checks are taking such a long time?
>
> Thanks
>
> Mark
>
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when  
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list