Passive check reception in nagios-1.2 does not scale (has anyone already written a patch?)

Bjørnar Bjørgum Larsen bjornar.bjorgum.larsen at ementor.no
Fri Apr 23 18:01:35 CEST 2004


Hello,

there seems to be no way of getting nagios-1.2 to process passive service checks faster than once per second. In our tests, setting 
command_check_interval=-1
produces far worse results than setting 
command_check_interval=1s

Problem:
If the above is generally true, given an average command length put into the external command file of 120 characters, and a FIFO length of 4096 characters, it's impossible for nagios-1.2 to handle more than 35 passive checks a second. We want our passive-only nagios-1.2 hosts to process at least 100 checks a second, and we don't see any other reasons why a nagios host doing nothing but receiving and forwarding checks can't do that.

Possible hack: 
Patching the nagios-1.2 source to use microseconds since epoch instead of seconds since epoch when scheduling checks (if doing so for external commands, everything else has to tag along, I think). Accompanied by a 'm' for milliseconds option, we can specify e.g command_check_interval=100m to get nagios to check its external command file 10 times every second. We need to modify the sources to use gettimeofday() instead of time(), a redefinition of the TIMED_EVENT struct, and dividing/multiplying things the other places TIMED_EVENT is used. Do you think this is a sound way of handling this? If anyone's done this  already (or anything else that significantly speeds up the checks of external commands), could you mail the patches to the list, please? 

According to rumour, nagios-2.0 keeps the FIFO empty. How's this done in 2.0? Has anyone performance tested nagios-2.0 external command handling? Any chance of a backport to 1.2, Ethan?


Test description:
Locally on the Nagios server, simply echo'ing 5000 lines of real nsca data to the external command file, one "echo $some_check_result > nagios.cmd" per check, as fast as possible, and measuring how long it takes.

If nagios is checking "as often as possible" (command_check_interval=-1) the results are varying a lot, best values are 15 checks / second. Worst results took so long I couldn't be bothered waiting. If checking once per second, we get results very close to the theoretical maximum of approx. 33 checks per second given FIFO length of 4096, since our average # characters per check put into the FIFO is 123. 

One problem with command_check_interval=-1 seems to be that nagios won't re-read the FIFO until it has finished processing all checks from the previous read. Note that we used (and reused) the same couple-of-days old data for these tests. Since I'm not sure exactly how nagios-1.2 computes when to schedule the next command-file-read with command_check_interval=-1, I don't know wether this invalidates this part of the test results.

We tested on a compaq dl360 P3 1133MHz, 256MB memory, running on Debian GNU/linux-2.4.25, and nagios-1.2. If need be, we'll use better hardware in production, but both load and memory use were low during these tests, so the hardware should not matter for the test results. We have no web server running on the nagios host we were testing.


Thank you in advance for all input, and have a good week-end!

:) Bjørnar


-------------------------------------------------------
This SF.net email is sponsored by: The Robotic Monkeys at ThinkGeek
For a limited time only, get FREE Ground shipping on all orders of $35
or more. Hurry up and shop folks, this offer expires April 30th!
http://www.thinkgeek.com/freeshipping/?cpg297




More information about the Developers mailing list