Scheduled downtime and host checks

Paul M. Dubuc work at paul.dubuc.org
Wed Jun 1 17:00:34 CEST 2011


Jeffrey Watts wrote:
> On Wed, Jun 1, 2011 at 1:27 AM, Kumar, Ashish <xml.devel at gmail.com
> <mailto:xml.devel at gmail.com>> wrote:
>
>
>         No, scheduled downtime only affects notifications, and the stats you
>         see in the availability cgi.  Service and host checks run as normal
>         during scheduled downtime.
>
>
>     Thanks Jim for the explanation but I do not see any rational reason
>     to execute host and service checks while the monitored host is
>     scheduled for "fixed" downtime.
>
>
> There are plenty of rational reasons.  Just because you disagree with
> the default behavior doesn't mean it's irrational.  Many, many, many
> times I put systems into scheduled, fixed downtime and still want checks
> to be executed.  For example, if I know the netadmins are going to be
> reconfiguring networking at one of our datacenters I will schedule fixed
> downtime for the period of their maintenance for the
> servers/switches/routers affected.
>
> However, I do want to see what's up and down during that time so I can
> tell when they start and finish their work, and what they're affecting.
>   That's a perfectly rational reason to do checks during maintenance.
>
>         This is useful because it allows you to
>         check the stats of those hosts and services are ok before the
>         scheduled downtime period ends.
>
>
>     But if the host/services are offline after the scheduled "fixed"
>     downtime period ends it will send the notifications anyway (or would
>     it not?)
>
>     I wish there was a way to disable active checks while a host has
>     scheduled downtime set.
>
>
> If the hosts and services are down after the downtime ends yes it will
> send notifications, as clearly either:
>
> 1) The maintenance window wasn't long enough.
> 2) Someone broke something, or something died for another reason during
> maintenance
>
> Sounds like proper behavior.
>
> As far as your question goes, you can disable active checks manually, or
> you can write a script that sets downtime and disables active checks at
> the same time.  You could then run it (manually or via 'at' or something
> else) to re-enable active checks.  Or hack the Nagios source code and
> add that option yourself.  I believe in the last week or so someone
> posted a sample script for setting downtime via a script, so you might
> search the archives.
>
> Jeffrey.

You give some very good reasons for Nagios current behavior during a downtime. 
  But I agree with the original request that there be an option to disable 
checks during a downtime because there are equally rational reasons to do so.

There are some cases where we really should not be running service checks 
during down times because of the extra load they put on our system when they 
fail.   Many of our checks fail in this case by timing out and they use 
relatively scarce (shared) and resource intensive processes (web browser 
sessions run under SeleniumRC).  Timeouts tend to be long for these checks so 
there is more contention for these processes when all the checks using them 
start failing, and they're run more often until they all go into a 'hard' 
failure state, etc.  Maybe we can live with this, but it would be easier on 
the system to just inhibit checks we know are going to fail during certain 
regularly scheduled down times.  There may be plenty of other examples where 
running lots of failing tests during a downtime end up using significant 
system resources.

We implement our regular downtimes by using by defining the uptime with a 
timeperiod and using that for the check_period and notification_period of our 
services.  The problem with that is that all the services get scheduled to run 
at the exact second that our "downtime" ends.  So we have to define a 
concurrency limit and rely on nagios nudging checks out when the limit is 
reached in order to spread the schedule out again.

It would be very nice to be able to define regular downtimes with timeperiods 
and have the option of inhibiting checks as well as notifications during those 
downtimes without bunching up the scheduling queue when the downtime ends.

------------------------------------------------------------------------------
Simplify data backup and recovery for your virtual environment with vRanger. 
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Data protection magic?
Nope - It's vRanger. Get your free trial download today. 
http://p.sf.net/sfu/quest-sfdev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Developers mailing list