Problem with high latencies after going distributed

Frost, Mark {PBG} mark.frost1 at pepsi.com
Thu Jan 24 23:13:13 CET 2008


 

>-----Original Message-----
>From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca] 
>Sent: Thursday, January 24, 2008 3:33 AM
>To: Frost, Mark {PBG}
>Cc: Nagios Users
>Subject: Re: [Nagios-users] Problem with high latencies after 
>going distributed
>
>-----BEGIN PGP SIGNED MESSAGE-----
>Hash: SHA1
>
>Some heavily broken intending there (looks like my mail client gets
>confused)... don't trust the number of ">"!
>
>On 23/01/08 10:47 PM, Frost, Mark {PBG} wrote:
>>  
>> 
>>> -----Original Message-----
>>> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca] 
>>> Sent: Wednesday, January 23, 2008 10:24 PM
>>> To: Frost, Mark {PBG}
>>> Cc: Nagios Users
>>> Subject: Re: [Nagios-users] Problem with high latencies after 
>>> going distributed
>> I don't think so. I remember an email from Ton Voon some time 
>> ago asking
>> Ethan why the oc[hs]p command are run serially but I don't recall if
>> there was a reply or what else was said...
>> 
>> I believe it's either documented in the official doc or some
>> user-contributed doc that the oc[hs]p commands should return 
>as soon as
>> possible. It's usually done in Perl using a fork:
>> 
>> if (fork==0) {
>>  # send stuff via NSCA here...
>> }
>> exit(0);
>> 
>> 
>>> I guess what I'm thinking here is that unlike a custom 
>check, I can't
>>> see most
>>> people needing to customize the passive check result 
>process.  All the
>>> solutions I've
>>> seen seem to include a named pipe.  So why couldn't Nagios support
>>> making the ocsp/ochp
>>> "commands" just named pipes instead.   Then instead of a standalone
>>> send_nsca binary,
>>> have the nsca source build a send_nscaD binary (I'm making 
>that up) that
>>> reads from the
>>> pipe that nagios writes to and sends directly to nsca on the server.
>>> That sort of
>>> eliminates the middle-man in the process of reporting passive check
>>> results.
>> 
>>> I know, I know, I'm free to write the send_nscaD.c code and 
>send it to
>>> Ethan :-)
>
>Well... I was thinking about partly re-writing nsca as an event-based
>daemon (supporting only the --single mode, but that would be really
>scalable) using libevent, allowing to pass along the timestamp 
> (this is
>a recent feature request) and supporting multi-line responses (for
>Nagios 3) in the process, and finally suggesting this as a base for a
>NSCA v3... I'm not even sure if I would have enough time but since my
>main objective it to learn I wouldn't loose anything trying :).
>
>In the unlikely event that I write it, In the same step I could surely
>to a C version of OCP_Daemon supporting natively the "NSCA v3" protocol
>(it wouldn't be hard)...
>
>I'll have to think about it... I quess the only sane separator to write
>multiple multi-line results on a pipe would be \000 (NULL), so there
>would be 3 mode of operation for send_nsca (and two for nsca_sendd
>(don't you think it sounds better reversed?)):
>send_nsca: compatible (v2 behavior), Single check (additional lines are
>taken as additional output) and multi-check (NULL separated)
>nsca_sendd: single-line (one check/line, OCP_Daemon style) and
>multi-line "NULL-separated).
>
>> I don't know how many people use OCP_Daemon but I had reports 
>>>>from a few
>> people that greatly reduced their latency using it and I 
>> haven't had any
>> bug reported yet. I believe it's well documented as well, but If you
>> have any feedback on this I'll be happy to get it.
>> 
>>> I'm playing with it a bit and have so far had good results. 
> I'll have
>>> some
>>> feedback after I've played with it a bit longer.  Thanks 
>for writing it
>>> and
>>> writing up the docs for it as well!
>
>Pass the thanks over to Ethan who sent me a Nagios NSA t-shirt 
>for it ;)
>
>Thomas

I can see that using the OCP Daemon script cut down on my latencies
quite a lot.  Unfortunately,
I'm still seeing some "stale" checks on the master server that I can't
explain.  I'm starting to
get the feeling that going distributed isn't all it's cracked up to be.
I haven't seen mention in
the docs of the caveats with oc[sh]p and latencies (my books sure don't
mention it) and even the
fact that the supplied submit_service_check script in the distribution
from Ethan is a shell
script that pipes to send_nsca.  I'm not all that excited about having
to do a workaround
for this issue.

While the OCP_Daemon seems to help me, I'm a little uncomfortable
running it as a solution to our
issue.  First, we don't normally have root access on our boxes so
recreating the FIFOs could be
a problem (or at least a wait).  I'm also concerned about requiring
another process external to
Nagios as part of the process.  If OCP_Daemon dies at some point, my
distributed nodes are hosed.
I had a few issues with correctly starting Nagios and OCP_Daemon in the
right order when playing
with it last night.  Once I got it all going, it worked well but I'm
thinking of having to explain
this to someone here who isn't the Nagios person.

I was thinking of your fork/exec comment above.  What if one were to
rewrite the "glue" shell
script (the one that takes the output from Nagios and pipes it to
send_nsca) and do something
similar, but write it in C?  Additionally, have the parent fork and exit
(causing Nagios to
think the oc[sh]p completed very quickly) then have the child go on and
send output to send_nsca
separately.  For my setup, this has the advantage of not being a
separate process that I need to
make sure continues to run.  It also doesn't require synchronizing
listeners on both ends of a pipe
or else one process would hang.  It would almost be even better, it
seems to me, if this script
could do the send_nsca functionality (again, as the child) instead of
even having to call send_nsca.
The biggest drawback I can see there is that you can't edit the C
program to show destination server,
etc.  You'd just about have to pile on a ton of command line options or
have a config file for it.

Just thinking out loud.

On a related note, I see that according to my performance stats, some
checks are still taking a
very long time to run.  Is there some easy way I can see check execution
time per check and track
down which checks are taking such a long time?

Thanks

Mark

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list