How to explain active host checks to boss

Marc Powell marc at ena.com
Wed Feb 13 18:29:06 CET 2008
Previous message: How to explain active host checks to boss
Next message: How to explain active host checks to boss
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

> -----Original Message-----
> From: nagios-users-bounces at lists.sourceforge.net [mailto:nagios-users-
> bounces at lists.sourceforge.net] On Behalf Of mark.potter at academy.com
> Sent: Wednesday, February 13, 2008 10:48 AM
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] How to explain active host checks to boss
> 
> 
> Background: Due to management requirements we are using NagiosQL as a
> configuration manager for our Nagios install. NagiosQL defaults to
active
> checks enabled for hosts so this is how it's been done until now. We
have
> the alerts coming as we want them. We are adding more hosts and
services
> weekly. I know that active host checks are not a good thing to have
going
> forward as they are unnecessary. Please advise on the best way to
explain
> this to the boss who is, at this moment, convinced that if we turn off
the
> option in the config file then the host will never be checked even if
a
> service is down. I can't find a good place in the documentation to
point
> this out and would like to get these turned off in the near future so
we
> don't run into issues later on down the road. Any help in pointing me
in
> the right direction would be appreciated. Here is a sample host cfg
from
> our environment:

Assuming you're using 2.x. The main issue with host checks in 2.x and
prior is that they are performed serially, not in parallel. While a host
check is being run, nagios stops absolutely everything else, other
host/service checks, notifications, etc until that single host check is
complete. To put this in perspective, assume that you have 100 hosts
checked with 10 pings over a 15 minute check_interval with a
max_check_attempts of 3. When every host is up, each host check will
take approximately 10 seconds to complete, during which nagios isn't
doing anything else except obsessing over that host --

100 hosts X 10 seconds = 1000 seconds 

As you can see, you've already exceeded your normal check interval of
900 seconds. Nagios cannot complete the host checks in the time interval
you've specified and you haven't even done any service checks yet. Now,
nagios will attempt to interleave service checks between host checks to
compensate but you've just introduced latency for both check types.

Now imagine that you have a simple outage. 5 hosts are down that aren't
related via parenting. Your timing now looks like --

(95 hosts X 10 seconds) + (5 hosts X 30 seconds) = 1100 seconds,
dedicated to host checks only.

Because the host checks aren't related, nagios is able to interleave
some service checks between so the latency isn't as bad as it could be.
Take the calculation above and determine the effects of a large outage.
Factor in parenting, where nagios will only being checking hosts up the
tree without interleaving service checks and you start seeing big
problems at the time that your monitoring systems is most critical and
useful. You could easily end up in a situation where hosts and services
aren't being checked for loooooooong intervals.


Nagios is smart. You don't need to schedule regular host checks because
nagios knows that if there is a problem with a service, it may be caused
by an outage of the host or a parent of the host. Nagios will
automagically run the host check_command anytime there is a non-OK
result from a service check, assuming only that active_checks_enabled is
on for the host and there is a valid check_command specified. It will
also follow the parents tree if the host check returns non-OK results
until nagios finds an OK parent or reaches the top of the tree. Even so,
you want to have your host checks finish as quickly as possible; 1 ping
max_check_attempts 3 times is usually sufficient to determine status.

Nagios-3 introduces parallel host check execution and there are some
benefits to running host checks there specifically for caching results
for possible use by the on-demand checks or if you're interested in
using host performance data for trending for example, but they aren't
necessary.

Some documentation to help --

http://nagios.sourceforge.net/docs/2_0/xodtemplate.html#host

"check_interval: 	NOTE: Do NOT enable regularly scheduled checks
of a host unless you absolutely need to! Host checks are already
performed on-demand when necessary, so there are few times when
regularly scheduled checks would be needed. Regularly scheduled host
checks can negatively impact performance - see the performance tuning
tips for more information. This directive is used to define the number
of "time units" between regularly scheduled checks of the host. Unless
you've changed the interval_length directive from the default value of
60, this number will mean minutes. More information on this value can be
found in the check scheduling documentation."


http://nagios.sourceforge.net/docs/2_0/networkreachability.html

"The main purpose of Nagios is to monitor services that run on or are
provided by physical hosts or devices on your network. It should be
obvious that if a host or device on your network goes down, all services
that it offers will also go down with it. Similarly, if a host becomes
unreachable, Nagios will not be able to monitor the services associated
with that host.

Nagios recognizes this fact and attempts to check for such a scenario
when there are problems with a service. Whenever a service check results
in a non-OK status level, Nagios will attempt to check and see if the
host that the service is running on is "alive". Typically this is done
by pinging the host and seeing if any response is received. If the host
check commmand returns a non-OK state, Nagios assumes that there is a
problem with the host. In this situation Nagios will "silence" all
potential alerts for services running on the host and just notify the
appropriate contacts that the host is down or unreachable. If the host
check command returns an OK state, Nagios will recognize that the host
is alive and will send out an alert for the service that is
misbehaving."

--
Marc


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: How to explain active host checks to boss
Next message: How to explain active host checks to boss
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list