Instrumenting Nagios

Steven D. Morrey smorrey at ldschurch.org
Thu May 21 15:48:55 CEST 2009


gprof doesn't like Nagios.
It generates a new profile data for each fork.
I have 30,000 service checks on 3,000 hosts that run each hour.
Even then it's ok for 30 minutes or an hour, but when you are trying to debug something that takes 2 or 3 days to show, it becomes nearly impossible to manage.
oprofile buggered the entire system on my development boxes (SLES 9 on VMWare).
Hence the need to instrument just the important parts.
Unless you folks know of some switch or another I can pass in at compile time to get the profile data to be manageable.

Thanks!

Sincerely,
Steve

________________________________________
From: eponymous alias [eponymousalias at yahoo.com]
Sent: Wednesday, May 20, 2009 7:50 PM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Instrumenting Nagios

To the extent that such delays may be partly
due to general cost of computing, profiling the
entire nagios binary would not be a bad idea.
gprof is your friend.

--- On Tue, 5/19/09, Steven D. Morrey <smorrey at ldschurch.org> wrote:

> From: Steven D. Morrey <smorrey at ldschurch.org>
> Subject: [Nagios-devel] Instrumenting Nagios
> To: "nagios-devel at lists.sourceforge.net" <nagios-devel at lists.sourceforge.net>
> Date: Tuesday, May 19, 2009, 11:11 AM
> Hi Everyone,
>
> We're trying to track down a high latency issue we're
> having with our Nagios system and I'm hoping to get some
> advice from folks.
> Here's what’s going on.
>
> We have a system running Nagios 2.12 and DNX 0.19 (latest)
> This setup is comprised of 1 main nagios server and 3 DNX
> "worker nodes".
>
> We have 29000+ service checks across about 2500 hosts. Over
> the last year we average about 250 or more services alarming
> at any given time. We also have on average about 10 hosts
> down at any given time.
>
> My original thought was that perhaps DNX was slowing down,
> maybe a leak or something so I instrumented DNX, by timing
> from the moment it's handed a job until it posts the results
> into the circular results buffer.
> This figure holds steady at 3.5s.
>
> I am pretty sure all checks are getting executed (at least,
> all the ones that are enabled) eventually. Just more and
> more slowly over time.
> Clearly, some checks are being delayed by something or even
> many things.  What I'd like to do is to perhaps extend
> nagiostats to gather information about why latency is at the
> level it is, to see if we can't determine why Nagios is
> waiting to run these checks.
>
> What should we be looking at, either in the event loop or
> outside of it, to get a good overview of how what and why
> nagios is doing what it's doing?
>
> We are thinking of adding counters to the different events
> (both high and low) in an attempt to determine the source of
> the latency in detail. For example, if the average check
> latency is 100 seconds, being able to show that 30 of that
> was spent doing notifications, and 20 seconds spent doing
> service reaping, etc. That way we can know where we need to
> make optimizations.
>
> I'm thinking that if we can instrument the following events
> we should have most of our bases covered (note some of these
> may already be instrumented)...
>
> log file rotations,
> external command checks,
> service reaper events,
> program shutdown,
> program restart,
> orphan check,
> retention save,
> status save,
> service result freshness,
> host result freshness,
> expired downtime check,
> check rescheduling,
> expired comment check
> host check
> service check
>
> Is there anything else that could or should be instrumented
> that could give us a good view in what nagios is doing thats
> causing service checks to be executed further and further
> away from when they were scheduled?
>
> Are these complete? Do these make sense to instrument and
> would they be useful in determining what is contributing to
> check latency?
>
>
> Thanks in advance!
>
> Sincerely,
> Steve
>
>
>  NOTICE: This email message is for the sole use of the
> intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, use,
> disclosure or distribution is prohibited. If you are not the
> intended recipient, please contact the sender by reply email
> and destroy all copies of the original message.
>
>
>
> ------------------------------------------------------------------------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables
>
> unlimited royalty-free distribution of the report engine
> for externally facing server and web deployment.
> http://p.sf.net/sfu/businessobjects
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>




------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, &
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


 NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp asthey present alongside digital heavyweights like Barbarian
Group, R/GA, & Big Spaceship. http://www.creativitycat.com 



More information about the Developers mailing list