Nagios Profiler Changes

Andreas Ericsson exon at op5.com
Tue Jun 16 20:11:43 CEST 2009


Steven D. Morrey wrote:
> Hi Everyone,
> 
> As you know I've been hard at work creating a profiler for nagios
> that is simple, flexible, extensible, fast and above all accurate.
> 
> My initial design was to create a collection of global timers gt_*
> and global odometers go_* variables that could then be written out to
> status.dat one by one. This worked ok but became quickly unwieldy for
> obvious reasons.
> 
> My next design was a linked list of objects containing the timer and
> the counter, as well as name or event type, this made extensibility a
> snap, but would have made a significant impact on speed since we
> would have to walk the list at best, and at worst do a strcmp on
> every single object every time we wanted to update a stat.  So this
> idea was discarded for the time being.
> 
> Finally I had a better idea.  Each event type is an integer and even
> though they aren't necessarily close together they would still be
> appropriate for an array index even if it's a sparse one. So this is
> the new profiler design.
> 
> We have an object containing elapsed time, counter, enabled
> 
> We have an array of these objects indexed by event type 
> profiler[event].counter++;
> 

Right. That sounds sensible as there's a quite limited number of event
types to profile.

> Then when we write it out to status.dat we have a very simple loop
> that looks to see if the event type is enabled for profiling and
> outputs it if it is. The output looks like 
> PROFILE_COUNTER_EVENT_SERVICE_CHECK=100
> 

I like it.

> Nagiostats then looks for the word PROFILE, and then for COUNTER or
> ELAPSED, then adds that to a linked list ala my second design, and
> outputs via mrtg or the normal nagiostats output.
> 
> The other major difference is what we are using to calculate time. In
> the original design we just used time(), but later we decided we
> needed more resolution so we went to clock(), finally it was
> discovered that using clock would introduce a bug every 72 minutes

Umm, that's not entirely true. It's not a bug, but a wrap-around. It
*would* be a bug if we were trying to capture events longer than 72
minutes, but I doubt that's the case. It would also have to be divided
by some suitable power of 10 in order to be able to actually store a
sum larger than 72 minutes, but that's true for gettimeofday() as well.

> and so now we just use gettimeofday In the next version I may include
> clock() time as well but I thought that this would be sufficient for
> our needs. Let me know what you think and I'll try to get a patch out
> ASAP.
> 

I think it would be sensible to store time as seconds with 3 decimal
points worth of precision. Note that with gettimeofday() we're measuring
wallclock time as opposed to CPU time. That's probably sensible, since
that's what Nagios is using to present latency too.

/Andreas

------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects




More information about the Developers mailing list