alternative scheduler

Fredrik Thulin ft at it.su.se
Wed Nov 24 11:34:01 CET 2010


On Wed, 2010-11-24 at 10:23 +0100, Jochen Bern wrote:
...
> I see. I should probably explain why I'm asking, then (everyone else,
> please excuse the wall of text):

Thanks, lots of interesting thoughts.

> Given a Nagios configuration (number of active checks, their
> check_period, check_interval, retry_interval, and max_check_attempts), a
> distribution of state changes, and (I hope) a bunch of Queueing Theory
> formulae, one can determine the average rate X/min at which checks
> *ought* to be scheduled and executed.

This is what I do, although admittedly I take _very_ many shortcuts in
my proof of concept.

I have a ticker that initiates the start of a new asynchronous service
check every N ms, regardless of how long it took to start the last
service check.

The ticker is deliberately synchronous though, to achieve a primitive
sort of blow-up prevention. What I mean is that if an asynchronous
service check worker suddenly takes N+1 ms to be spawned (spawned in
Erlang is not the same as fork()ed), I won't start another one until the
next tick occurs.

> In evaluating a new check
> scheduler, the first thing I'd be interested in would be its
> *correctness*, from the detail (single host/service) up to the global
> level (yielding a rate of X/min, not less, nor more - hence my confusion
> about your "the more checks per minute, the better!" stance).

Right. What level of correctness are you talking about here? Erlang is a
soft real time system, and absolute correctness in this regard I believe
would require a (hard) real time system.

I'm not even operating on dedicated hardware, and very close to the
limit of what my hardware can do, but this is the number of checks I've
for the last number of five minute periods :

  11195, 11746, 11295, 10632, 11020, 11190, 11174, 11460, 11693,
  10980, 11596, 11378, 11159

This is a virtual machine in a VMware ESX cluster that is known to have
performance issues.

Just from observations on the graph at
http://people.su.se/~ft/test/mrtg_nagios-dev-srv1/nagios-f.html
it seems obvious that I have much much less fluctuation when operating
at a level that is more comfortably within the limits of the hardware.

> Once correctness has been established, one can go on to check whether
> it's a "good" scheduler. However, there's more than one definition of
> quality that one may use. One possibility is to measure the *maximum*
> sustainable rate of checks that can be (scheduled and) executed. ...

Agreed. For me, the immediate goal was to get checks executed in a
timely fashion. My production system currently reports these sad numbers
for service check executions :

<= 1 minute:59 (1.0%)
<= 5 minutes:466 (7.7%)
<= 15 minutes:1444 (23.9%)
<= 1 hour:6012 (99.4%)
Since program start:  6048 (100.0%)

Whether or not it understands things like check_period, retry_intervals,
max_check_attempts is really irrelevant as long as it can't even invoke
service checks fast enough to even pretend to be busy. The current five
minute load average on the machine running the service checks is 0.55.

/Fredrik



------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev




More information about the Developers mailing list