Patch RFC - Nagios 3.2 - permanently remove sleep on run_event == FALSE in main loop (events.c) or conditionally remove using nagios.cfg configuration parameter?

Max perldork at webwizarddesign.com
Fri Oct 30 16:32:32 CET 2009


Hi,

We have been working on reducing the scheduling skew for Nagios
service checks through a number of different techniques; yesterday we
were looking through the main event loop in events.c and saw that when
an event is encountered that is *NOT* scheduled to run, Nagios sleeps
the sleep_time amount configured in nagios.cfg with a comment about
not hogging CPU.

While this certainly can be a useful thing to do for environments with
less powerful hardware or where performance data intervals are not as
critical as 'playing nice' is, it adds a lot of scheduling skew to
Nagios for environments (like ours) that have requirements to get
performance data into other systems at very regular intervals and if
nanosleep is used, it actually drives the load up on the system over
time ( on RHEL 5.1, 5.2, and 5.4 at least).

We commented out that code in our environment yesterday and noticed that:
* Our latency increase over time decreased significantly
* System load decreased noticeably as nanosleep is not being called
thousands of times in a polling cycle (test env has 9000 active
services on ~ 1400 hosts with ~ 800 not runnable due to service
dependency rules)

To give real numbers, our latency pre-patch was going from 0 to 12
seconds within about 10 hours; post patch latency has only increased
to about 1 second after 14 hours of running on this build.  We measure
when latency is too high by when our SNMP counter-based check
intervals increase to the point that we are 10% more than the
configured interval (e.g. 330 seconds if the interval is 300 seconds)
as that then causes gaps in the time series data warehouse we send our
performance data to.

Pre patch load after 12-14 hours was increasing to 7, post patch after
14 hours system load has levelled off around 3-4 .. this is on a dual
quad core intel system with 8 GB RAM.  Service check performance /
minute is around 2k checks.

So while this was a trivial thing to change, for a larger environment
it makes a very noticeable difference in performance and we would like
to contribute it as a performance patch.

So I am thinking that we could conditionally perform that additional
sleep if use_large_installation_tweaks in nagios.cfg is set to 0
instead of just removing the code and submit that as our patch.

Thoughts / opinions?

- Max

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference




More information about the Developers mailing list