alternative scheduler

Max Schubert maxs at webwizarddesign.com
Thu Dec 2 15:08:44 CET 2010


The problem with any smoothing or readjustment of time intervals comes
in when performance metrics are being collected along with state - not
having a stable interval between checks throws off intervals between
data points in metric databases.

Some amount of jitter in intervals  can be accounted for when
inserting data points into metric databses with some fairly simple
math (truncating intervals to nearest minute for example) but if
intervals are not pretty accurate then using metrics over time for
trending and comparison gets to be much trickier and requires a lot of
mathematical adjustments on view if we are say looking at trend lines
for 10 or 20 elements at once - this then scales very poorly when
wanting to view hundreds or thousands of metric lines at once - even
if they are aggregated first (which is usually done in some fashion
with hugh #s of metrics).

We have mitigated this issue a bit by adding truncation code before
inserting metrics into our long term trending data warehouse - that
means that what goes in falls on even minute intervals, making
graphing a cheap operation evenr many data points.

Our longer term resolution to this will be to decouple fault
management tests from metrics collection as the metrics really make us
have to watch service latency and intervals for snmp delta metric
collection hard - it is a PITA.  We plan on having an agent on every
system that focuses on streaming metrics to collectors, thereby
freeing the polling based tests from having to be locked into very
accurate check intervals.

Max

On 12/2/10, Andreas Ericsson <ae at op5.se> wrote:
> On 12/02/2010 12:36 PM, Jochen Bern wrote:
>> On 12/02/2010 10:46 AM, Andreas Ericsson wrote:
>>> On 12/02/2010 10:03 AM, Jochen Bern wrote:
>>>> Unless I *really* need new glasses, there's only three different kinds
>>>> of such rescheduling code in the 3.2.x Nagios core:
>>>> 1. Reschedule *exactly* check_interval / retry_interval from last due
>>>> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>>> This could trivially be changed by the simple expedient of scheduling the
>>> checks with a random component and offsetting the check backwards in time
>>> by half the random flex component.
>>
>> (Which is what I've hacked into the core right now - as I mentioned, a
>> random offset of -7..0 seconds, typically every check_interval = 5
>> minutes, takes ~6h to undo the peak-building of the nightly logfile
>> rotation.)
>>
>
> If you use -15..+15 seconds it will spread a lot faster.
>
>>>> 2. Reschedule to the *very first second* permitted by check_period -
>>>> e.g., base/checks.c::278ff :
>>> Here we could do a similar tweak, adding a random number between 0 and 60
>>> to the scheduler. It wouldn't be perfect, but it would be better than the
>>> current scheme, and with a half-decent PRNG it would mean checks would
>>> stay smoothed out for the duration of Nagios' lifespan.
>>
>> Where "smoothed out" is defined as "randomly distributed in the first
>> minute of a valid timeframe, spreading further due to check_interval
>> randomization for as long as the timeframe runs, and losing all the
>> latter randomization as they skip over the next *in*valid timeframe".
>>
>
> The "losing all the randomization" won't be necessary if the checks
> were to be stepped by whatever recheck interval we're currently using
> instead of set fixedly to the first second of the next valid timeframe.
>
>>>> Case 2: *Increase* next_check so as to stay within the check_period, but
>>>> determining a max increment which simultaneously smoothes out the
>>>> (potentially MANY) affected checks and avoids pushing the chain of
>>>> subsequent processing (retry_interval / max_check_attempts if found
>>>> non-OK, running event handlers, ...) *beyond* the valid timeframe is
>>>> definitely nontrivial.)
>>> Not really.
>>
>> Let me play devil's advocate for a second and sketch my (so far)
>> worst-case thought scenario:
>>
>> 1. A *very* expensive check which should be done only once per day
>> during a low-load period, as long as the result is OK.
>> -->  check_period approximately == low-load period, check_interval larger
>> than the length of the check_period's "valid" timeframe.
>>
>> 2. In cases where the test returns non-OK, a certain (low) number of
>> rechecks shall be done to guard against secondary influences (say, temp
>> LAN hiccups).
>> -->  max_check_retries and retry_interval such that their product is
>> still reasonably lower than the length of the "valid" timeframe.
>>
>> 3. As soon as the service turns HARD non-OK (rather random choice, the
>> formulae would change if we'd instead use the last SOFT non-OK result,
>> but the problem stays pretty much the same), an event handler triggers
>> some corrective action (try to fix the problem within the low-load
>> period). This action needs some time to complete - let's assume it
>> doesn't agree well with the retry_interval. Once it's completed, we want
>> a last-ditch check.
>> Since we already set "too high" a check_period in step 1, we need the
>> event handler to trigger the action, make an educated guess whether it
>> might succeed, and if yes, schedule the last-ditch check through the
>> external command interface (to be executed X seconds later).
>>
>> 4. Now let's do the math: In order to make sure that the last-ditch
>> check will still fall into the check_period, and not taking any
>> retry_interval randomization into account, we need the *first* check to
>> get scheduled between period_begin and
>> 	period_end - (max_check_retries-1)*retry_interval - X
>> 		- [some time for event handler latency&exec]
>> where X is a substantial delay programmed into the event handler,
>> nowhere to be found in the data available to Nagios itself.
>>
>
> Or we can just inform users that the period for which they want their
> such very specialized checks to run should be longer than the desired
> check_interval + (retry_interval * (max_check_attempts + 1)) to get
> something up and going quickly.
>
> As an aside, the proper way to smooth out load would be to assign to
> each check a "load-score", which gets sampled every so ofte with the
> checks that ran the past sample fram. Each scheduling queue slot should
> get a pre-defined maximum load. This would let hundreds of low-load
> checks run at the same time, while heavy checks would be run almost in
> serial. The load-score should probably be set automatically and at
> least resemble online_cpus * 2 or something.
>
> The code to make that happen wouldn't be exactly trivial though, and
> cheap checks that are run in parallel with heavy ones will get unfairly
> penalized by this system. That shouldn't matter much though, as they'll
> quickly be separated so that checks with a high load-score aren't run
> at the same time, and then the values for the lower-load plugins will
> auto-adjust over time.
>
> Or we could let users assign certain commands the "heavy-load" warning,
> which could let Nagios only schedule few such checks to run in parallel.
> check_esx3 comes to mind as a suitable candidate for such an option.
>
> --
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
>
> Considering the successes of the wars on alcohol, poverty, drugs and
> terror, I think we should give some serious thought to declaring war
> on peace.
>
> ------------------------------------------------------------------------------
> Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
> Tap into the largest installed PC base & get more eyes on your game by
> optimizing for Intel(R) Graphics Technology. Get started today with the
> Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
> http://p.sf.net/sfu/intelisp-dev2dev
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev




More information about the Developers mailing list