Hi all, I'd like to propose an overhaul of the Performance Info (extinfo.cgi?&type=4). In the last weeks I prepared a migration and update from our old 2.9 install to a new physical machine and nagios 3.0. During that time I've been watching the Performance Info a lot, since performance was an issue for us as the "migration machine" was running inside a VM on an ESX. Sadly I came to the conclusion, that the way the info is presented seems to be useless. The reason is simple: For example I get the number and percent of the actively checked services in the last 1/5/15/60 minutes. So far so good. But what exactly tells us this info? Right - nothing. I have no means to interpret this information, as I cannot determine if the number of actively checked services in the last minute (for example) is good or bad. What's missing is numbers to compare the actively checked services to those that _should_ have been actively checked in the last minute. In our scenario, I have loads of services scheduled each minute (pings, disk, memory, etc.pp), but then I do have a lot services that are only checked once per hour or once per day. So when nagios presents me with 68% of my servicechecks were performed in the last minute - I have no clue if that means everything is alright or what. What I would like to see is a comparable performance info, telling me: x% of your active service checks in the last minute, that should have been checked, have been checked. x% of your acrive service checks scheduled in the last 15 minutes,that should have been checked, have been checked. etc.pp. So I can decide if I am putting too much stress on the nagios server or not. And if, if it's the fault of too many concurrent servicechecks for example, that are lagging behind. I do know that latency and execution time is displayed too, but those informations are not really useful to me either. Which brings me to the next point: Check Execution Time needs some means to distinguish between checks that timed out and those that just took long. For as long as I can think, the displayed values there look like: Check Execution Time: 0.01 sec 10.01 sec 0.494 sec 0.01 is checks on localhost - they are the minumum 10.01 is checks that timed out, mainly remote sites where the vpn is currently down for example - they are the maximum 0.5 is roughly the average at all times. I think people wouldn't even notice, if you would hardcode those numbers in the cgi ;) Infos that are more or less static are not useful as performance counters. To reflect the real circumstances, timed out checks need to be filtered out, so I have means to see if some checks take longer then expected. /discuss S -- Sascha Runschke Netzwerk- und Systemmanagement Telefon : +49 (201) 102-1879 Mobil : +49 (173) 5419665 Fax : +49 (201) 102-1102105 GFKL Financial Services AG Vorstand: Dr. Peter Jänsch (Vors.), Jürgen Baltes, Dr. Till Ergenzinger, Dr. Tom Haverkamp Vorsitzender des Aufsichtsrats: Dr. Georg F. Thoma Sitz: Limbecker Platz 1, 45127 Essen, Amtsgericht Essen, HRB 13522