Request for comment: Overhaul of Performance Info

Sascha Runschke Sascha.Runschke at gfkl.com
Wed Apr 2 14:43:56 CEST 2008


Hi all,

I'd like to propose an overhaul of the Performance Info 
(extinfo.cgi?&type=4).

In the last weeks I prepared a migration and update from our old 2.9 
install to
a new physical machine and nagios 3.0. During that time I've been watching
the Performance Info a lot, since performance was an issue for us as the
"migration machine" was running inside a VM on an ESX. Sadly I came to the
conclusion, that the way the info is presented seems to be useless.

The reason is simple:

For example I get the number and percent of the actively checked services
in the last 1/5/15/60 minutes. So far so good. But what exactly tells us 
this info?
Right - nothing. I have no means to interpret this information, as I 
cannot determine
if the number of actively checked services in the last minute (for 
example) is good
or bad. What's missing is numbers to compare the actively checked services
to those that _should_ have been actively checked in the last minute. In 
our
scenario, I have loads of services scheduled each minute (pings, disk, 
memory, etc.pp),
but then I do have a lot services that are only checked once per hour or 
once per
day.
So when nagios presents me with 68% of my servicechecks were performed
in the last minute - I have no clue if that means everything is alright or 
what.

What I would like to see is a comparable performance info, telling me:

x% of your active service checks in the last minute, that should have been 
checked, have been checked.
x% of your acrive service checks scheduled in the last 15 minutes,that 
should have been checked, have been checked.
etc.pp.

So I can decide if I am putting too much stress on the nagios server or 
not. And if,
if it's the fault of too many concurrent servicechecks for example, that 
are lagging behind.

I do know that latency and execution time is displayed too, but those 
informations are not
really useful to me either. Which brings me to the next point:

Check Execution Time needs some means to distinguish between checks that 
timed
out and those that just took long. For as long as I can think, the 
displayed values there
look like:

Check Execution Time:  0.01 sec 10.01 sec 0.494 sec 

0.01 is checks on localhost - they are the minumum
10.01 is checks that timed out, mainly remote sites where the vpn is 
currently down for example - they are the maximum
0.5 is roughly the average at all times.

I think people wouldn't even notice, if you would hardcode those numbers 
in the cgi ;)
Infos that are more or less static are not useful as performance counters. 
To reflect the real circumstances,
timed out checks need to be filtered out, so I have means to see if some 
checks take longer then
expected.

/discuss

S

-- 
Sascha Runschke
Netzwerk-  und  Systemmanagement
Telefon : +49 (201) 102-1879 Mobil : +49 (173) 5419665 Fax : +49 (201) 
102-1102105



GFKL Financial Services AG
Vorstand: Dr. Peter Jänsch (Vors.), Jürgen Baltes, Dr. Till Ergenzinger, Dr. Tom Haverkamp
Vorsitzender des Aufsichtsrats: Dr. Georg F. Thoma
Sitz: Limbecker Platz 1, 45127 Essen, Amtsgericht Essen, HRB 13522
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20080402/e2e200b5/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list