[PATCH] circular dependency check speedup

Andreas Ericsson ae at op5.se
Mon Dec 20 23:58:13 CET 2010


On 12/20/2010 09:02 PM, Glenn Herteg wrote:
>   >>  I've done some organoleptic analysis of the patch (looked at it).
> 
> The trouble with organoleptic analysis is that it's axiomatic in the
> benchmarking business that your guess (or mine) as to where time is
> being spent is simply wrong.  The software+hardware computational
> model is much too complex for people to have accurate intuition
> about how all the elements interact.  Actual measurements are needed.
> 

Agreed. But experienced people are rarely surprised when the bottleneck
happens to be in a particular place. Thanks for the writeup btw. I'll
respond more in-depth below.

>   >  I'll respond in more detail in a few days.
> 
> GroundWork developed the circular dependency patch by profiling
> the code while running preflight operations on a large customer
> configuration.  A pre-flight operation lists the following counts:
> 
>       Checked 35471 services.
>       Checked 3362 hosts.
>       Checked 116 host groups.
>       Checked 0 service groups.
>       Checked 290 contacts.
>       Checked 225 contact groups.
>       Checked 0 service escalations.
>       Checked 51031 service dependencies.
>       Checked 0 host escalations.
>       Checked 0 host dependencies.
>       Checked 475 commands.
>       Checked 256 time periods.
> 
> Total size of the set of Nagios config files is under 18 MB.
> 
> To skip immediately to the final results, here are the timings from
> test runs while successive stages of this patch were being developed:
> 
>       baseline:  pre-flight:  859.113 seconds
>       stage 1:   pre-flight:  185.139 seconds
>       stage 2:   pre-flight:   42.856 seconds
>       stage 3:   pre-flight:    3.107 seconds
> 
> The baseline run was taken using a stock Nagios 3.2.0 release.
> The remaining runs were taken using patched versions of Nagios 3.2.1,
> which does not differ from 3.2.0 in its calculations for circular
> dependency analysis.
> 
> The timings shown above are wall-clock measurements for end-to-end
> pre-flight operations.  We also took detailed call-graph profile
> timings (using gprof) during this development effort to guide
> the code changes, but those reports were not retained afterward,
> so I don't have corresponding function-level timings available
> now to document each stage of patch development.  The claim of a
> speedup of around 20,000x for the circular dependency analysis is
> based on improvements seen in those function-level timings during
> the development.
> 
> ========================================================
> 
> Stage 1:
> 
> Initial profiling showed that most of the time was being spent
> in pre_flight_circular_check().  There's no i/o in that routine
> or the ones it calls, so right away we know the problem we face is
> computational, not due to any system file-cacheing effects.  Also,
> there's not much code in that routine that could be the source of the
> problem, so the nature of the bad performance was fairly obvious:
> an O(n^^2) clearing of the circular_path_checked values within
> independent data structures in a linked list.  To address that,
> we made the following changes for the first stage of patching:
> 
> * make a contiguous array for
>     servicedependency_circular_path_checked, effectively removing the
>     circular_path_checked flags from the individual data structures
>     and also changing the boolean flags from 4 bytes down to 1 byte
> * use a call to memset() to clear the entire array in one operation,
>     rather than walking a linked list
> 

Why not calloc() it from the beginning? Since it's created when
Nagios is started one can often avoid even the memset(), since
memory gotten from the kernel is already guaranteed to be zero'd
out. This also avoids libc going through fortification routines
and setting the allocated memory to 0xb5b5b5b5 (or whatever magic
sequence it uses to denote allocated-but-not-written-to memory).

> Extra time is now taken to prepare the new contiguous array, but
> the timing results clearly show the benefit far outweighs the cost.
> 
> The revised code is still O(n^^2) (as we still have to clear just
> as many flags, and just as often), but one dimension of this square
> has been radically shrunk.  Modern processors are highly optimized
> for clearing a continuous swath of memory to a fixed value, since
> clearing large areas to zero is such a commonly needed action.
> So we use that mechanism instead of jumping randomly around in
> memory.  (Aside from benefits of using probably-microcoded processor
> capabilities for the contiguous clearing, such as highly-efficient
> multimedia instructions, we also avoid a tremendous amount of
> potential cache-line aliasing and attendant inefficiency in memory
> accesses, and possibly extra page faults.)  With contiguous boolean
> flags, we were also able to change their type from int to unsigned
> char, reducing the space and thus further reducing the amount of
> time needed for the clearing action.
> 
> These changes resulted in a 4.64x speedup in overall pre-flight time
> from the baseline (down from 859.113 seconds to 185.139 seconds).
> 

Neat. And so far so good.

> ========================================================
> 
> Stage 2:
> 
> Profiling now showed check_for_circular_servicedependency_path()
> as the bottleneck.  So we made the following changes for the second
> stage of patching:
> 
> * make a contiguous array for servicedependency_list, so it can be
>     walked more efficiently, similar to above
> * factor some repeated pointer dereferencing out of loops

Still so far so good.

> * re-order servicedependency_struct members, based on timing tests
>     and examination of the compiled assembly code

This is one of the timing tests I'm most interested in. How much did
this gain you, and on what architecture(s) was this tested? What is
the difference in timing if this change is reverted?

Since this is the only real showstopper, I'll need to back it up with
proper numbers in order to justify the change, so jumbling it in with
a bunch of other changes and saying "these 6 changes caused this much
speedup" doesn't do me any good. Access to your testconfig would neatly
solve the problem of me badgering you for these numbers. Rip out any
sensitive information first and then just put the config up for grabs
so I can test it myself.

> * simplify some boolean tests (e.g., no need to test for
>     (found==TRUE) [a specific non-zero value] when (found!=FALSE) or
>     just (found) compares against zero [a highly optimized operation]
>     and gives the same useful result; and no need to even store
>     a transient boolean value in a memory location when it can be
>     tested directly after production)

This is the difference between

  cmp reg1,reg2
  je someplace
and
  or reg1,reg1
  jz someplace

The real gain is actually in not having to load the second register,
and it's often lost in the noise unless one takes extra pains to
ensure proper pipeline saturation. I like idiomatic programming though,
so I have no problems with this.

> * drop some redundant NULL-pointer tests
> 
> Once again in this stage, extra time is now taken to prepare the new
> contiguous array, but the timing results clearly show the benefit
> far outweighs the cost.
> 
> This stage of patching reduced the critical bottleneck in the
> processing to a single loop, compiled to 6 assembly instructions, one
> of which was a no-op filler instruction, with the loop having only
> one external memory reference per iteration.  Basically, you can't
> get any tighter than that and do any useful work.  The remaining
> performance problem was that this loop was being executed for
> 1.3 billion iterations when analyzing the customer dataset, which
> (even at only about 27.6 nanoseconds per iteration) still adds up
> to macroscopic time.  (Remember what an O(n^^2) algorithm is all
> about, at the scale of 51031 service dependencies, half of which
> are execution dependencies and half of which are notification
> dependencies.  (51031/2)^^2 + (51031/2)^^2 = 1.3 billion.)
> 

Wouldn't a more sensible approach have been to make sure this loop
isn't executed more than a maximum of 51031 times then?

Since each dependant service will be dependant on at least one
other service, and each service can be marked as "not circularly
dependant" after a successful check of it, each service that in
turn depends on an already checked and non-circular service is
by definition non-circularly dependant. Then you can let this
loop take 1 millisecond and it would still complete in a very
small fraction of the time the original algorithm took even
without micro-optimizing the inner loop.

> These changes resulted in a 20.0x speedup in overall pre-flight
> time from baseline (down from 859.113 seconds to 42.856 seconds),
> and a 4.32x speedup from the previous stage (down from 185.139
> seconds to 42.856 seconds).
> 
> ========================================================
> 
> Stage 3:
> 
> check_for_circular_servicedependency_path() remained as the
> bottleneck, so we analyzed why we had so many iterations of its
> internal looping.  That turned out to be because a lot of data in
> essentially random order (with respect to what we care about in
> this analysis) was being searched linearly, and again by jumping
> around in memory while walking the full length of a linked list --
> another clear opportunity for structure+algorithm optimization.
> So we made the following changes for the third stage of patching:
> 
> * sort the servicedependency_list contiguous array
> * use a specialty binary search to find the first dependent service
>     we care about in the sorted list, if present
> * change the search for dependent services to examine only a short
>     sequence of actual candidates, starting at that first dependent
>     service
> * re-order more servicedependency_struct members, again based on
>     a timing study
> * perform some trivial code cleanup from the previous patch stages
> 
> One might wonder whether the additional O(n*log(n)) time now taken
> to sort might be a significant overhead to bear, but the timing
> results clearly show the benefit far outweighs the cost.
> 

Except on solaris and irix, where the worst-case sorting scenario
(sorting an already sorted list) actually costs O(n^^2). But I
digress.

> Why do we use a specialty search instead of some standard library
> function?  With respect to the usual suspects:
> 
> * lfind(3) would be slow; its linear search of an array is exactly
>     what we're trying to avoid here.
> * bsearch(3) [binary search of a sorted array] and tfind(3) [find
>     a matching element in a binary tree] both return an unspecified
>     element if there are multiple elements that match the key.
> 
> What we need instead is to always find the first element in a
> contiguous series of matching elements, if any such elements exist,
> so that we can walk just the full length of that segment of the
> sorted array and then be done.  So we need to implement our own
> routine to perform the search.
> 

Agreed, but why implement the same algorithm twice? Write it once
and reuse it. Nagios is already riddled with duplicated code that
is almost-but-not-exactly identical for hosts and services when
they often should be exactly the same. There's no need to add more
such insanities.

> This stage of patching avoids walking the full list of dependencies
> when only a few (now in contiguous sequence, and easy to quickly
> identify) need to be analyzed.  Profiling showed that the dependency
> analysis, which was taking many hundreds of seconds in the original
> code, had now been reduced to only a few hundredths of a second;
> hence the claimed 20,000x speedup.
> 
> These changes resulted in a 276.5x speedup in overall pre-flight
> time from baseline (down from 859.113 seconds to 3.107 seconds),
> and a 13.79x speedup from the previous stage (down from 42.856
> seconds to 3.107 seconds).
> 

Impressive indeed.

> ========================================================
> 
> The discussion above reflects improvements in circular service
> dependency analysis.  Exact corresponding changes were also made to
> improve circular host dependency analysis.  That accounts for the
> nearly-identical code in the patch, as different data structures
> for hosts and services are being handled.  In this regard, we
> are simply mirroring the structure of the code we are patching,
> following the requirements of C-language type handling.
> 
> ========================================================
> 
> An additional part of testing involved purposely introducing a number
> of circular dependencies of various types into the test dataset,
> to establish that the revised code still found exactly the same
> errors as the original baseline code.
> 

This should alleviate Ton's worries, although I know he would be much
more at ease with the patch if he got to see the test-configs you
produced and tried this with so he can run the tests himself, and
possibly make automated test-cases for them.

> ========================================================
> 
> What's the practical impact of this patch?  Well, it can prevent
> lengthy monitoring outages when Nagios is bounced.  For example,
> I would be completely unsurprised if the effect that Rodney Ramos
> reported a couple of weeks ago ("NSCA dropping packet with stale
> timestamp") is due to the existing dependency-analysis performance,
> if his configuration involves a lot of dependencies.  While Nagios
> is spending time analyzing dependencies, it won't be reading its
> command pipe, and pipe writers such as NSCA will quickly fill the
> pipe and be unable to write more.  That could stall its reading
> of its own client connections to the point that when it finally
> does read them, the packets it finds are then stale.  Look back in
> the archives and you'll see that nobody responded to his request
> for help.  This patch may be the solution for his situation.
> 

But your patch revolves only around config reading and not the runtime
processing of dependencies when they need to be looked up for checking
and notification purposes, so how can this be the case?

> ========================================================
> 
> So what's the upshot of these changes, with regard to the previously
> published critique?
> 
> * The several stages of patch development make it clear that
>     algorithm changes and structure changes can go hand-in-hand to
>     achieve significant performance improvements.
> 

That's no surprise. The other way around would have been surprising
indeed.

> * All of the structure changes were made based on observed timing
>     improvements at the stage when they were implemented, not on
>     guesswork about what the compiler might or might not be doing.
> 

I still have no numbers or test-data to back up the claim that the
structure reorganization provides such a significant speedup that
it's actually worth applying that part of the patch.

> * When we re-order certain structure members, there is a potential
>     tradeoff:  some member accesses will become much faster, while
>     others that were presumably formerly fast will now become slower.

I doubt they will (on both counts) for the structs in question, so I'll
take some numbers with that claim, if you don't mind.

>     This tradeoff seemed reasonable on the following grounds.
> 
>     The accesses we care about for circular dependency analysis are
>     grouped together at a singular point in time, and have a very
>     visible effect then, during process startup.  Slower accesses for
>     other members will be spread out during normal operation across
>     the entire day, and should therefore be essentially unnoticeable,
>     causing imperceptible and acceptable latency by eating very
>     slightly into the processor idle time.  No, we haven't analyzed
>     the rest of the code, nor have we made timing measurements to
>     prove this hypothesis, but that's our gut feeling.  Yes, I know
>     how I started this writeup.
> 
> * All tests were run in an x86_64 environment, as these days it
>     is the sweet spot of the computing industry.  Five years ago,
>     I would not have said that, but in the last couple of years, it
>     has become difficult to purchase a serious machine which is not
>     multi-core and 64-bit capable.  That's clearly where the market is
>     today and in the future, for anyone who cares about performance
>     at scale (which is what this patch is intended to address).
>     The x86_64 architecture suffers slightly with respect to 32-bit
>     x86 because it uses more memory bandwidth to fetch wider objects,
>     but it more than makes up for this by architectural extensions
>     that make more registers available.
> 
> * An i386 (32-bit x86) environment was not tested, nor did we run
>     tests on UltraSPARC, PowerPC, Itanium, or any other architectures.
>     Naturally, we have no guarantees that improvements targeted to
>     the x86_64 won't slightly decrease performance on other machines.
>     But given the wide popularity of the x86_64 architecture, targeting
>     the most commonly used machines as the basis for optimization seems
>     more than reasonable.  Whatever small degradation might appear
>     on these other platforms should be much more than made up for by
>     the dramatic overall performance improvement afforded by the patch.
> 
> Rather than fighting this patch because the structure member changes
> will break NEB compatibility in the 3.2.X releases, and wasting
> everyone's time trying to disentangle the synergistic effect of
> both algorithm and structure changes, perhaps a different approach
> is in order.  The commentator has already expressed a strong desire
> to move to the 3.3.X release series:
> 
>   >>  Do you have any ETA about the release date of the 3.3 series ?
>   >  No. Ethan is still the only one who can push the release button.
>   >  Otherwise I'd have cut a release quite a while ago.
> 
> So perhaps adoption of this patch should be treated as the triggering
> event to move forward and declare this new minor release level.
> 

You're missing the point. The patch can't be applied as-is until we set
a merge-window for backwards-incompatible changes, and set a release-date
for when we want to break the ABI. So you can either rip out the struct
changes and resubmit (preferrably with your bsort_first() shared between
host and service dependencies) or get me test-configs along with numbers
justifying why this particular change should cause us to pin down the
release-date for a backwards-incompatible change.

I can see that the patch provides a nice performance-boost, but it's
simply not good enough to make me endure answering questions from NEB
module authors on why their dependency mangling module doesn't work
with the latest CVS code.

Simply put; I value 5 potentially wasted minutes of my time higher
than 10,000 years worth of potentially saved CPU time.

I'll rip out the struct changes myself and apply the rest unless you
convince me why I shouldn't. Saying "it's faster. trust me!" won't
work. Handing me the config so I can test it myself would.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d




More information about the Developers mailing list