Problem with high latencies after going distributed

Thomas Guyot-Sionnest dermoth at aei.ca
Wed Jan 23 04:29:03 CET 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 22/01/08 09:13 PM, Frost, Mark {PBG} wrote:
>  
> 
>> -----Original Message-----
>> From: Steve Shipway [mailto:s.shipway at auckland.ac.nz] 
>> Sent: Tuesday, January 22, 2008 8:45 PM
>> To: Frost, Mark {PBG}; Nagios Users
>> Subject: RE: [Nagios-users] Problem with high latencies after 
>> going distributed
>>
>> We've just done exactly the same (Nagios 2.9), and we have a comparable
>> size of system (actually a bit larger - 713 hosts, 5834 services).
>> After going distributed, we too have this insanely high latency on the
>> satellites.
>>
>> The only possible cause is the OCSP command slowing things 
>> down somehow.
>> This is using the supplied send_nsca call to send the status off to the
>> central server...
>>
>> define command {
>>    command_name    relay
>>    command_line    $USER1$/submit_check_result "$HOSTNAME$"
>> "$SERVICEDESC$" "$SERVICESTATEID$" "$SERVICEOUTPUT$"
>> }
>>
>> So it should work.  I guess things would be better if it packaged the
>> updates up into batches, although it cant do that normally.
>>
>> I think it might be better to make the OCSP command just dump 
>> the status
>> to a file, and then have a cronjob every 60 seconds that reads the file
>> and sends the statuses off as a batch.  I will try this here, 
>> when I get
>> the chance.
>>
>> Steve
> 
> 
> But if the submit_check_result is running slowly, that would only affect
> the service
> execution time wouldn't it?  My understanding of check latency is that
> it's the difference
> in time between when Nagios schedules a check to run versus the time
> that the check
> actually starts to execute.

You're right, but you're just missing one detail. Nagios runs checks in
parallel and then reaps all the service results at once. While it's
reaping it can't schedule other checks and it is in the reaping state
that Nagios runs host check, event handlers, performance data commands
and oc[hs]p commands. All this is done serially and can slow down
significantly each service reaping run and thus delay the execution of
further checks.

I although I never built a distributed system, I designed mine to be
easily distributed. Moreover, I used a technique I developed for
latency-free performance-data processing (That I still heavily use BTW)
to create a way to distribute check results to to a distributed central
server in the same latency-free way (Was more like a fun project as I
don't use it myself yet).

Basically you use the host/service performance data files to get the
data, but instead of writing to a file you write it to a named pipe
(fifo). That pipe is then read by a high-performance non-blocking
event-based Perl daemon (yeah I know that looks like marketing terms,
but I can explain further each of them if you like) that forks send_nsca
processes to send results in bulk (normally every few seconds though).

So Nagios doesn't even loose time rotating a file and all your checks
are transmitted almost instantly. See this wiki page for details and code:

http://www.nagioscommunity.org/wiki/index.php/OCP_Daemon


Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHlrR+6dZ+Kt5BchYRAgPAAKD7Rj6esSEe+yU4oiw6f+zI5SwTQgCeLJRS
Kc+BjLetcWxzanZOREHO8ks=
=2pY+
-----END PGP SIGNATURE-----

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list