alternative scheduler

Jochen Bern Jochen.Bern at LINworks.de
Wed Nov 24 10:23:21 CET 2010


On 11/23/2010 09:08 PM, Fredrik Thulin wrote:
> On Tue, 2010-11-23 at 20:43 +0100, Jochen Bern wrote:
>> On 11/23/2010 01:59 PM, Fredrik Thulin wrote:
>>> I was able to write a brand new scheduler that works MUCH better - 1160
>>> checks per minute, compared to ~60. Any plans to do something drastic
>>> about the Nagios service check scheduler?
>> One question, for sake of clarification: Does your definition of "check
>> scheduling" include the mid-term planning (i.e., "check returned OK,
>> should be repeated after the configured check_interval, if check_period
>> permits" and the likes), or only the short-term scheduling of "due"
>> checks onto the resources for actual execution (in the style of a
>> (distributed) batch queue)?
> The proof of concept is super simple - it was all done in less than six
> hours time.
> You load it with in my case ~6000 checks, and say that you want them
> started in N seconds (in my case 300 seconds). 
[...]
> Improving the scheduler to support different check_intervals etc. would
> not be difficult, but is something I've never utilized with Nagios to
> date.

I see. I should probably explain why I'm asking, then (everyone else,
please excuse the wall of text):

Given a Nagios configuration (number of active checks, their
check_period, check_interval, retry_interval, and max_check_attempts), a
distribution of state changes, and (I hope) a bunch of Queueing Theory
formulae, one can determine the average rate X/min at which checks
*ought* to be scheduled and executed. In evaluating a new check
scheduler, the first thing I'd be interested in would be its
*correctness*, from the detail (single host/service) up to the global
level (yielding a rate of X/min, not less, nor more - hence my confusion
about your "the more checks per minute, the better!" stance).

Once correctness has been established, one can go on to check whether
it's a "good" scheduler. However, there's more than one definition of
quality that one may use. One possibility is to measure the *maximum*
sustainable rate of checks that can be (scheduled and) executed. Another
gauge is that, if the scheduler goes to work on a handcrafted, badly
distributed initial schedule, it will smooth out the load within Y
cycles with a max deviation of Z % from the {check,retry}_interval.

Which brings us to the current Nagios code. In some installations,
random influences make the scheduled check times "flow together" into
peaks of workload (see the attached graph for what happens to my
scheduling every midnight when Nagios rotates the log). Nagios (3.2.x)
does *not* fix such peaks unless you do a restart with *complete*
rescheduling (I hacked a random -7..0 seconds offset into the code,
which smoothes out my midnight-induced peaks over the course of ~6 hours).

Anyone who has to work with check_periods a lot has even more of a
problem. If the {check,retry}_interval would place the next check
outside the check_period, Nagios will schedule the next check for the
*very first second* of the upcoming in-period timeframe - *ALL* of them.
In a case reported by another colleague, that made for a fireball of
20,000 checks in the same second - which blew a redundant pair of Nagios
servers clear out of the water.

Kind regards,
								J. Bern
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SchedGraph.2010-11-23-23:55.png
Type: image/png
Size: 35519 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20101124/b20250c2/attachment.png>
-------------- next part --------------
------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list