How to explain active host checks to boss

mark.potter at academy.com mark.potter at academy.com
Wed Feb 13 19:52:56 CET 2008


nagios-users-bounces at lists.sourceforge.net wrote on 02/13/2008 11:29:06 
AM:

> 
> 
> > -----Original Message-----
> > From: nagios-users-bounces at lists.sourceforge.net [mailto:nagios-users-
> > bounces at lists.sourceforge.net] On Behalf Of mark.potter at academy.com
> > Sent: Wednesday, February 13, 2008 10:48 AM
> > To: nagios-users at lists.sourceforge.net
> > Subject: [Nagios-users] How to explain active host checks to boss
> > 
> > 
> > Background: Due to management requirements we are using NagiosQL as a
> > configuration manager for our Nagios install. NagiosQL defaults to
> active
> > checks enabled for hosts so this is how it's been done until now. We
> have
> > the alerts coming as we want them. We are adding more hosts and
> services
> > weekly. I know that active host checks are not a good thing to have
> going
> > forward as they are unnecessary. Please advise on the best way to
> explain
> > this to the boss who is, at this moment, convinced that if we turn off
> the
> > option in the config file then the host will never be checked even if
> a
> > service is down. I can't find a good place in the documentation to
> point
> > this out and would like to get these turned off in the near future so
> we
> > don't run into issues later on down the road. Any help in pointing me
> in
> > the right direction would be appreciated. Here is a sample host cfg
> from
> > our environment:
> 
> Assuming you're using 2.x. The main issue with host checks in 2.x and
> prior is that they are performed serially, not in parallel. While a host
> check is being run, nagios stops absolutely everything else, other
> host/service checks, notifications, etc until that single host check is
> complete. To put this in perspective, assume that you have 100 hosts
> checked with 10 pings over a 15 minute check_interval with a
> max_check_attempts of 3. When every host is up, each host check will
> take approximately 10 seconds to complete, during which nagios isn't
> doing anything else except obsessing over that host --
> 
> 100 hosts X 10 seconds = 1000 seconds 
> 
> As you can see, you've already exceeded your normal check interval of
> 900 seconds. Nagios cannot complete the host checks in the time interval
> you've specified and you haven't even done any service checks yet. Now,
> nagios will attempt to interleave service checks between host checks to
> compensate but you've just introduced latency for both check types.
> 
> Now imagine that you have a simple outage. 5 hosts are down that aren't
> related via parenting. Your timing now looks like --
> 
> (95 hosts X 10 seconds) + (5 hosts X 30 seconds) = 1100 seconds,
> dedicated to host checks only.
> 
> Because the host checks aren't related, nagios is able to interleave
> some service checks between so the latency isn't as bad as it could be.
> Take the calculation above and determine the effects of a large outage.
> Factor in parenting, where nagios will only being checking hosts up the
> tree without interleaving service checks and you start seeing big
> problems at the time that your monitoring systems is most critical and
> useful. You could easily end up in a situation where hosts and services
> aren't being checked for loooooooong intervals.
> 
> 
> Nagios is smart. You don't need to schedule regular host checks because
> nagios knows that if there is a problem with a service, it may be caused
> by an outage of the host or a parent of the host. Nagios will
> automagically run the host check_command anytime there is a non-OK
> result from a service check, assuming only that active_checks_enabled is
> on for the host and there is a valid check_command specified. It will
> also follow the parents tree if the host check returns non-OK results
> until nagios finds an OK parent or reaches the top of the tree. Even so,
> you want to have your host checks finish as quickly as possible; 1 ping
> max_check_attempts 3 times is usually sufficient to determine status.
> 
> Nagios-3 introduces parallel host check execution and there are some
> benefits to running host checks there specifically for caching results
> for possible use by the on-demand checks or if you're interested in
> using host performance data for trending for example, but they aren't
> necessary.
> 
> Some documentation to help --
> 
> http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#host
> 
> "check_interval:    NOTE: Do NOT enable regularly scheduled checks
> of a host unless you absolutely need to! Host checks are already
> performed on-demand when necessary, so there are few times when
> regularly scheduled checks would be needed. Regularly scheduled host
> checks can negatively impact performance - see the performance tuning
> tips for more information. This directive is used to define the number
> of "time units" between regularly scheduled checks of the host. Unless
> you've changed the interval_length directive from the default value of
> 60, this number will mean minutes. More information on this value can be
> found in the check scheduling documentation."
> 
> 
> http://nagios.sourceforge.net/docs/2_0/networkreachability.html
> 
> "The main purpose of Nagios is to monitor services that run on or are
> provided by physical hosts or devices on your network. It should be
> obvious that if a host or device on your network goes down, all services
> that it offers will also go down with it. Similarly, if a host becomes
> unreachable, Nagios will not be able to monitor the services associated
> with that host.
> 
> Nagios recognizes this fact and attempts to check for such a scenario
> when there are problems with a service. Whenever a service check results
> in a non-OK status level, Nagios will attempt to check and see if the
> host that the service is running on is "alive". Typically this is done
> by pinging the host and seeing if any response is received. If the host
> check commmand returns a non-OK state, Nagios assumes that there is a
> problem with the host. In this situation Nagios will "silence" all
> potential alerts for services running on the host and just notify the
> appropriate contacts that the host is down or unreachable. If the host
> check command returns an OK state, Nagios will recognize that the host
> is alive and will send out an alert for the service that is
> misbehaving."
> 
> --
> Marc
> 
That is precisely the sort of explanation I needed. I think I have the 
convincing done but there seems to be some concern about how this will 
show up on tac.cgi and in other places. It will show as disabled if I am 
not mistaken. I think management may be concerned about this for reasons 
only management understands. Under the Hosts bar in tac.cgi it will show 
all 309 hosts as being disabled correct? Since the documentation 
recommends disabling active host checks for obvious reasons why is this 
shown on tac.cgi under Hosts and again under Active Checks (in red 
nonetheless). I almost wish I understood management at this point...
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20080213/db20e578/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list