Uptime Calculation Question

Breandan Dezendorf breandan at dezendorf.com
Fri Feb 11 14:27:53 CET 2011


On Fri, Feb 11, 2011 at 7:03 AM, Kevin Keane <subscription at kkeane.com> wrote:
> The trick is to carefully select what you are actually checking. You
> probably don't want to run 5000 checks every five minutes, but you really
> only need to have one check, or a few at most, per server that will tell you
> whether or not whatever you are monitoring is up; that should be enough
> for your SLA. Make sure that check is very inexpensive computationally,
> and you can safely run it once per minute.

On Fri, Feb 11, 2011 at 2:48 AM, Jim Avery <jim at jimavery.me.uk> wrote:
> I think the trick is only to set short check interval for those
> services where accurate stats are critical.  For example, for a web
> server, set check_interval to a short value for check_http but a long
> value for disk checks, memory checks, log file checks and so on.

I understand that increasing the check_interval decreases the margin
of possible error.  I was hoping someone with a statistics bent could
show me that "with a check_interval of X minutes, your accuracy in
nines can be reported with an accuracy of up to Y percent".  I will
probably go the route of inexpensive and frequent checks for services
where we're trying to promise a specific number of nines.  That will
be a communication issue with the other administrators and the
customers, and doesn't rate discussion on this list.

All of that said, I'm still unable to shake the feeling that a
poll-based system is the wrong thing to use for uptime calculations.
It doesn't help that all alternatives coming to mind are messy and
inelegant, and would likely need to be customized to each service -
and possibly each instance of each service.  (Processing application
log files, processing system logs, doing horrible things like running
constant dtrace/strace metrics, etc)  Clustered and load balanced
services are even harder, depending on your setup, as the hypothetical
system would have to be able to correlate logs from multiple devices.
And sadly, monitoring is only one of the things I'm responsible for.

-- 
Breandan Dezendorf
breandan at dezendorf.com
bwdezend at gmail.com

------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list