nagiostats Bug with Active Service Checks

tanner tanner at linuxbox.com
Mon Feb 23 23:38:03 CET 2009


Hello,

During the course of a recent distributed deployment, I discovered a bug 
in nagiostats (and possibly Nagios) that lead to misleading statistics 
in certain situations.

In particular, I set things up so that every distributed server knew 
about all of the service checks, but inherited several properties 
(active_checks_enabled, notifications, etc) from a single configuration 
file that was unique on each Nagios server. After initially loading up a 
single monitoring host with a couple thousand service checks, I shuffled 
them out to the other distributed hosts. This led to nagiostats 
reporting insane numbers for the active check latency of the initially 
loaded up host but realistic numbers for the other ones.

It appears that nagiostats uses check_type to determine whether to 
process a service as though it is active, rather than 
active_checks_enabled. This may well be fine if Nagios correctly reset 
check_type after a configuration reload, but it doesn't appear to change 
it.

It looked like, as I changed services to active_checks_enabled = 0, the 
active service latency average went higher and higher. Looking in 
status.dat, the recently disabled services (which, by the by, still had 
an active check scheduled when they were switched to 
active_checks_enabled=0) would eventually time out and have a massive 
latency, which would be averaged in with the rest of the latencies.

This was specifically with Nagios 3.0.6, my apologies if this has been 
fixed since the latest stable release.

The attached patch may be the correct answer or is may be a work around 
for Nagios only setting check_type the first time a service is created 
in status.dat. Either way, it was the quickest way for me to get more 
accurate latency information, so I thought I'd share it along with the bug.

Feel free to let me know if there's any questions or if my diagnosis was 
entirely wrong.

Thanks,
Tanner



-- 
Tanner Beck
The Linux Box
734.761.4689
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: nagiostats-active-service-check-check.diff
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090223/c785195f/attachment.ksh>
-------------- next part --------------
------------------------------------------------------------------------------
Open Source Business Conference (OSBC), March 24-25, 2009, San Francisco, CA
-OSBC tackles the biggest issue in open source: Open Sourcing the Enterprise
-Strategies to boost innovation and cut costs with open source participation
-Receive a $600 discount off the registration fee with the source code: SFAD
http://p.sf.net/sfu/XcvMzF8H
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list