[PATCH] Re: alternative scheduler

Adam Augustine augustineas at gmail.com
Wed Dec 1 20:55:15 CET 2010


On Wed, Dec 1, 2010 at 8:33 AM, Jochen Bern <Jochen.Bern at linworks.de> wrote:
> On 12/01/2010 03:14 PM, Andreas Ericsson wrote:
>> A much better solution would be to spawn workers to handle the checks
>> and let the master parent just sit and receive results and update status
>> files though, but that's a quite invasive change so it'll have to wait a
>> bit.
>
> Would DNX or Mod Gearman (which both have a NEB module snatch checks
> before the Nagios core gets around to executing them itself, and feed
> them into sort of a distributed batch queue system instead) be close
> enough to qualify?
>
> http://dnx.sourceforge.net/
> http://labs.consol.de/nagios/mod-gearman/
>
> Kind regards,
>                                                                J. Bern
> --

While DNX and mod_gearman do implement that specific functionality,
they are still subject to the scheduler/reaper bottlenecks. We (the
institution that started the DNX project) have played around with the
check scheduling parameters quite a bit over the years and even with
our best scheduling parameters and DNX actually executing the plugins,
we still see checks scheduled such that we have a large number of
checks scheduled to execute in a single second with several seconds
(3-5) of nothing scheduled to execute between. That isn't necessarily
a big problem as long as the DNX and mod_gearman workers can handle
the peaks. But you then have to provide bigger hardware than if the
checks were scheduled more smoothly. And really, all things being
equal, the average number of checks scheduled to execute in any given
second should be relatively constant.

I suppose it is possible we have something mis-configured because we
have misunderstood the inner workings of the scheduler, but I am at a
loss as to where, and we have spent a lot of time in the past looking
at how checks are scheduled and executed.

I haven't used Merlin yet (I intend to do some testing), but the model
of distributed schedulers each handling smaller numbers of checks
works around that problem. But if it works that way then it really
just hides the issue. DNX actually posts results directly to the
circular results buffer to bypass some of the reaper issues. I noticed
Andreas' blog posting includes breaking the reaper into its own thread
to get the level of performance shown, which makes sense, and I think
an attempt to do that was posted some time ago on this list by Steve
Morrey.

Anyway, the scheduler was, I think, originally designed to be very
conservative in terms of CPU use, and back in the day that made sense
with the limited hardware that was available. I think now the
expectation is that large installations will be dedicating hardware
and wanting Nagios to consume as much as has been allocated to it.
Clearly changing some of the sleeps to sched_yield would be a good
beginning. Putting the reaper into another thread (as Andreas blog
posting indicates) is another massive improvement.

But I think the scheduler still needs to be looked at, unless we are
one of a small group seeing that behavior.

Adam Augustine

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev




More information about the Developers mailing list