How to reduce a very high latency number

Greg Cope greg.cope at e-dba.net
Wed May 24 10:46:48 CEST 2006


Jacob,

I noticed the same thing today.

We run a few distributed servers that do about 150 checks (at the
moment) and submit this to our central server.

That's allot of send_nsca processes that get spawned.

I like you fix!

send_nsca would not appear to be scallable for those running lots of
passive checks with distributed systems.

Greg

On Tue, 2006-05-23 at 09:48 -0400, Jacob Ritorto wrote:
> Greetings,
>        A colleague of mine (poctum) and I ran into something like
> this while using nsca and have crafted a similar solution.  We
> observed that send_nsca was sending only one result to the central
> Nagios server per connection.  Testing revealed that send_nsca was
> capable of handling thousands of results per connection.  Sending only
> one at a time was resulting in lots of dropped data because there were
> nominally about 5 results derived per second.  We enabled
> aggregate_status_updates in the nagios.cfg file, but this yielded no
> improvement in the result submissions.  BTW, this is Nagios-2.2 and
> nsca-2.6 on Solaris 10.  Our workaround is a quick and dirty but
> efficient solution.  It may not be as refined as trask's and relies on
> nuances of unix file handling algorithms to get the job done.  That
> said, it's working perfectly for us.  As this seems to work well, but
> may violate Ethan's design intentions, your feedback/input is
> requested.  Deploy at your own risk.
> 
> Jacob Ritorto, Lead UNIX Server Operations Engineer
> InnovationsTech
> 
> Here's our solution:
> 
> 1) Altered last line in
> /opt/nagios/libexec/eventhandlers/submit_check_result thusly.  It
> basically concatenates check results to a temp file.
> 
> #/bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" |
> /opt/nagios/bin/send_nsca 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg
> 
> /bin/printf "%s\t%s\t%s\t%s\n" "$1" "$2" "$return_code" "$4" >>
> /opt/nagios/var/results.waiting
> 
> 
> 2) Created a daemon process called reap (managed by smf, but it has
> been up for a month so far, so may be ok as an init.d script) to pull
> aside the aforementioned temp file (results.waiting) every five
> seconds and send the bits off to the central Nagios server (note that
> original file is re-created immediately via step 1 above).  This
> probably only works perfectly on unix & unix-like systems due to the
> nature of files hanging around intact until the last program
> referencing them has exited.  It's been some time, but the last I
> checked, DOS/WINxxxx doesn't treat files this way.  Here's the simple
> little reap daemon:
> 
> # cat /opt/nagios/bin/reap
> #!/usr/bin/tcsh
> while (1)
>  sleep 5
>  mv /opt/nagios/var/results.waiting /opt/nagios/var/results.sending
>  cat /opt/nagios/var/results.sending | /opt/nagios/bin/send_nsca
> 172.16.x.x -c /opt/nagios/etc/send_nsca.cfg >/dev/null
> end
> 
> 
> Summary:  Slave Nagios servers now store up check results in the temp
> file for 5 seconds, then they get shipped off to nsca on the central
> Nagios machine in one swoop instead of one-at-a-time.
> 
> 
> *~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~*~
> 
> 
> 
> From: Trask <trasko at gm...>
> Re: How to reduce a very high latency number
> 2006-05-23 03:50
> 
> On 5/22/06, srunschke at abit.de <srunschke at abit.de> wrote:
> > nagios-users-admin at lists.sourceforge.net schrieb am 17.05.2006 20:09:16:
> >
> > To me this is obviously a performance issue related to hardware.
> > Your machines have way too few RAM. It is totally not possible to
> > run 1800 checks on a 512MB machine in a timely manner.
> >
> 
> I figured this out this past Saturday.  It is not any lack of the
> hardware.  I was seeing negligible load nor an excessive use of
> memory.  No configuration change I made seemed to have any appreciable
> effect on the latency times I was getting.  I ended up doing a "top"
> with 1 second intervals and just watching it for awhile.  I noticed
> that sometimes there would be a good number of nagios processes
> 20-30-40 or so, but the majority of the time there were only 2, 3 or 4
> processes.  Although I do not know exactly *why* this was happening,
> it ends up the during the time where there was 2-4 processes running,
> 2 of them were always the"submit_passive_check" script and
> "send_nsca".  It appears that this is being done serially (ie not in
> parallel) and ends up blocking subsequent checks until they are done.
> I would see these 2 processes running (with steadily increasing PIDs)
> for up to a minute and then a short-lived (4-5 seconds) "explosion" of
> nagios processes (service/host checks).  After this flurry of
> activity, it would be another 60 seconds or so of just 2-4 processes.
> 
> I resolved this problem by changing by "submit_passive_check" script.
> Below are some sample scripts, both old and new.  The short of it is
> like this:  Previously, the "submit_passive_check" script did a printf
> of the data in the appropriate format and piped it to the "send_nsca"
> command (in a shell script).  I have eliminated this bottleneck by
> having "submit_passive_check" redirect its output to a named pipe and
> then having another script feed "send_nsca" with that data as it comes
> in to the named pipe.
> 
> Latency times have dropped from the 600-700 seconds to 0.2 seconds on
> the worst server and from 45-55 seconds to 0.06 on the 2nd to worst.
> That's more like it!
> 
> Below are a few scripts w/ notes as to what each one is.  Thanks to
> everyone who offered help.
> 
> ~trask
> 
> 
> -------------------------------------------------------
> Using Tomcat but need to do more? Need to support web services, security?
> Get stuff done quickly with pre-integrated technology to make your job easier
> Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
> http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null



-------------------------------------------------------
All the advantages of Linux Managed Hosting--Without the Cost and Risk!
Fully trained technicians. The highest number of Red Hat certifications in
the hosting industry. Fanatical Support. Click to learn more
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=107521&bid=248729&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list