check scheduling when checks are inhibited.

Andreas Ericsson ae at op5.se
Tue Nov 23 09:50:13 CET 2010


On 11/22/2010 10:41 PM, Paul M. Dubuc wrote:
> We're using Nagios 3.2.3 for simulation of monitoring load in a load test
> environment as well as for monitoring production services.  I've notices some
> interesting behavior in the way Nagios schedules checks when checks are
> inhibited either though the CGI Process Commands or by setting a check_period
> timeperiod that inhibits checks during regularly scheduled down times.
> 
> Normally Nagios seems to spread out host and service checks evenly over time
> but when checks are stopped with the Process Command, Nagios seems to
> reschedule checks so that they are "bunched up" much closer together.  This
> creates alternating periods of densely scheduled and more sparsely scheduled
> checks that seem to persist when checks are turned on again.  It has a
> noticeable effect in our load testing.  The only way--or the quickest way--to
> get Nagios to smooth out the schedule again is to stop the process completely
> until all the scheduled check times have passed.
> 
> In testing Nagios monitoring of our production services, if I use the
> check_period to inhibit checks during our down times, I notice that as the
> downtime approaches, ALL checks are rescheduled for the exact time that the
> downtime ends (according to the check_period).  This creates a big spike in
> monitoring activity after the downtime.  One way to avoid this, I think, is to
> let checks run during the down times but inhibit notifications instead by
> using the timeperiod to define a notification_period.  But I wonder if this
> "bunching" up of the schedule when using check_periods is ever a desirable
> behavior.
> 

I have some plans to make Nagios spread the checks with a randomized interleave
factor so that a check scheduled to run once every 5 minutes can be run anywhere
between 4m 30s and 5m 0s after it last ran. The 30 second random-spread would be
the default and it would otherwise be configurable.

Another thing worth looking into is to make services to the same host not run
simultaneously, in case the checked server is expected to be loaded heavily
it may not play nicely with 30-40 checks fired at it at once.

You really should be using scheduled downtime for regular downtime though. There
are pre-hacked solutions to automagically reschedule re-occurring downtime. Ninja
supports it out of the box as of the latest version (or possibly latest git).

> These aren't critical issues for us since we can work around them
> procedurally.

That's good to hear.

>  But I wonder if there his a way to prevent the scheduled checks
> from getting bunched together like this if/when you need to inhibit checks for
> a time while keeping Nagios running. Maybe the auto_rescheduling options in
> the nagios.cfg are meant to address this, but they have a potentially negative
> effect on performance according to the comments around them in the file.
> 

The below text is what I'd call "educated speculation" after having thrown
a quick glance at the code. I might be completely wrong, but I don't think
so.

Not potentially; They do have a negative sideeffect. This is because they
maintain the scheduling intervals between checks stable over time by adding
them to the scheduling queue all the time when they're supposed to run, but
not actually executing them. So if you've scheduled downtime for 4 hours and
have a default check-interval of 5 minutes, auto_rescheduling will schedule
the check every 30 seconds (default) that entire time, but not actually run
the check command unless it's time to do so.

On the one hand, it shouldn't actually cause any major problems since it'll
still do less than it would do were the checks enabled. On the other hand,
it should be solveable without such hackery, but with the downside that
a check executed 3 minutes before downtime started may not be executed again
until a few minutes after downtime ends. That's how the auto_reschedule
option works too though, if I'm reading the code correctly.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list