Nagios 3 Performance Monitoring

Hendrik Bäcker andurin at process-zero.de
Fri Oct 26 16:44:58 CEST 2007


Hi List,

### Now the complete Mail ###

since a few days I was testing some performance issues with Nagios 3
(current CVS Version).

For nicer graphing I've written a small & dirty Perl script to parse
some relevant data from the nagiostats binary.

Output of the plugin is:

1. STDOUT: OK - output | perfdata
2. (optional) Output + Performancedata printed directly the the external
command pipe of Nagios.

I am running a relativ huge installation with up to 5 instances (for
load balancing) on one hardwareserver (yes - that works).

Some Backgrounddata:

Instance 1: 371 / 2156 (Hosts/Services)
Instance 2: 206 / 1405 (Hosts/Services)
Instance 3: 381 / 3147 (Hosts/Services)
Instance 4:   3 /   54 (Hosts/Services)
Instance 5: 299 / 3233 (Hosts/Services)

I have enabled the "use_large_installation_tweaks" feature for all
instance and was realy happy to see that I have _no_ latency at all.

But after 7-9 hours running time I see that the host/service check
throuput went down, the host/servicecheck execution time wents up (x2.5)
and latency comes up too.

After the beginnings of the latency the graph seems to see no end. It
goes up to 700 seconds for my fifth instance, I guess it will increase
if I hadn't restartet the nagios process.

############################################################
If you are interested in, you can see the graphs on:

http://www.process-zero.de/nagios3/nagiosperformance-20071026-1607.pdf

The "Plugin" I've written for this is on:

http://www.my-plugin.de/wiki/doku.php/projects:check_nagios_performance

(It's not fine enough to be a 'real' Plugin, so there is no reason to
post it on nagiosexchange.org yet).
############################################################

Back to problem.

I guess the 'performance trouble' seems to be a 'during runtime'
problem. So I am looking for some blowing up tasks in the code, my
actual guess is the update_check_stats() in base/utils.c which es
executed on every service check und more than one time for every host
check i think.

My idea is, that after a while the data structure for stats reaches a
amount that will take too much time for update and therefor the
execution time increases.
Higher exec time leads to less host/service checks leading to more
latency, but this is just a guess.

I would like to know what other people thin about this and it would be
nice if there are other people out there who are able to produce some
nice graphs about the performance with nagios 3.

Kind regards,

Hendrik

PS. Sorry sorry sorry for my fast fingers on my last try sending to this
list ;)

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list