Generating % error when generating reports?

Greg Vickers g.vickers at qut.edu.au
Wed May 18 01:54:13 CEST 2005


Hi all,

When thinking about SLAs and reporting, I had a thought:
When generating reports, say for an active service, the point in time 
that the service actually changed states is not the same as when Nagios 
detects that state change. Therefore there is a margin of error (fairly 
small for a state duration that is long relative to the regular 
check_interval of that service) in the reports that Nagios generates.

If there were to be a patch to allow a % margin of error to be 
calculated for a given report, would the pseudo code look something like 
this (at a high level - only accounting for HARD state changes):

(time0) found HARD service state change (e.g 0 sec)
... get the regular check_interval for that service (e.g. 5 min or 300 sec)
(time1) found HARD service state change (e.g 100 min or 6000 sec after 
time0)

calculate % of error in report: 2*300/(6000 - 0)*100 = 10% margin of 
error (ow)

The above calculation assumes the worst possible timing (300 secs) 
between a state change and Nagios actually detecting that change (2 
times 300s because there may be 300 sec time for detection of the first 
state change and 300 sec later for the detection of the second state 
change) and does not account for a manually re-scheduled service check. 
(The responsible contact may fix the service then schedule a check for 
now - there would be a small time window.)

Obviously you could reduce this % of error by reducing the check times 
for critical services or by using passive checks. (One will increase the 
load on the monitoring server and the monitored hosts, the other may not 
be suitable.)

Generating this % value is not terribly realistic as the check will 
probably happen less than 300 seconds after the state changes state.
However, if this % value is available, Nagios administrators could then 
give more certainty to the PHBs about the report values (some PHBs 
actually RTFM the Nagios doco, damnit,) rather than have a PHB say 
"Service blah went down at time x, but your report shows it as down at 
time y."

Anyway, just a thought I had and an idea I had that I wanted to share 
with -devel, get your machine guns out...

-- 
Greg Vickers
Computer Systems Officer
Teaching and Learning Support Services, Systems and Architecture
Queensland University of Technology
Kelvin Grove Campus, E409
Phone: (07) 3864 8276
Mobile: 0416 001 674, Speed Dial #6 6147
Email: g.vickers at qut.edu.au
TALSS web site: http://www.talss.qut.edu.au/

CRICOS No. 00213J



-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click




More information about the Developers mailing list