[PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report

Stephane LAPIE stephane.lapie at darkbsd.org
Tue Jan 4 04:37:29 CET 2011


Hello list,

I apologize in advance should this topic have already been raised in the
past.



We make fairly intensive use of Nagios at our company (around 1700
machines, for 26000 services), using a cluster of OpenBSD machines.

We do distribution using NSCA (a re-made Ruby implementation of the
server), and external handler programs to offload sending the packets
(which leaves to Nagios the sole task of writing results to a named pipe).

While tuning my configuration and creating several service groups
(simply for display purposes), I stumbled upon several problems :

1) An actual bug : Beyond a certain number of members, Nagios simply
fumbles at handling service checks for affected services within its
child processes, and then reports the failure with a very misleading
error message : "Warning : Return code 127 was out of bounds. Make sure
the plugin you're trying to run actually exists". (when the EXACT same
configuration, minus service groups, works perfectly fine)

I haven't pinpointed the final cause for this one, and I think I have
simply found a triggering case, but this seems to hint at a deeper
problem in the check handling. (Additionally, the message associated
with code 127 should be made more accurate, as I spent several days
figuring if any combination of funny PATH environment variables and such
could prevent the execution of my scripts)

As a temporary fix for my setup, I removed the related servicegroups
entries, and I am running fine for now, but I am hoping this will be
fixed in a future version, as this is really more than just a small
annoyance. :(


2) A performance problem : The MACRO_SERVICEGROUPMEMBERS code is
painfully slow and extremely costly in CPU performance. The attached
patch file is my attempt at fixing the most obvious issues :
 - Repetitive malloc/realloc (I initially caught on this by ktrace-ing
the processes and realizing Nagios was mapping/unmapping a lot of memory).
 - Repetitive string duplications and length calculations

The above code has been tested for a few hours on a busy Nagios setup
and performs much faster, as expected. (Reduction of several thousands
of malloc/realloc calls to 1, by initally calculating the memory size to
be allocated, thus avoiding unneeded system calls and memory areas
duplication)


3) Which brings me to a feature request : Nagios does not cache the
generation output of standard macros such as service group members
(derivated from configuration, and therefore static within any given
Nagios process), and has to go through the process of regenerating the
list every single time a child process is executed and environment
macros are set. This is extremely time-consuming, and further
performance improvements could be achieved through this.

I might try implementing it on my own, but I would appreciate a few
pointers as to whether there is a framework or previous work within
Nagios that would facilitate the job.



Further about the aforementioned bug :

I somehow have a value at which (and probably beyond which) the bug can
be reproduced (but it does not seem to be the direct cause). The
"symptoms" can be tracked down to MACRO_SERVICEGROUPMEMBERS generating a
338084 bytes string (35 services, assigned to 294 machines via templates).

During my initial observations, before making the above patch, I noticed
the incriminated service checks would not even be executed (this was
confirmed by directly hacking /bin/sh (used by popen()) and listing
every scripts executed, except for these ones)

Therefore, I initially thought the massive slowdown caused by the
inefficient malloc/realloc loop would cause the child to time out and
the plug-in script to be reported as erroneously "non-existent" (because
the parent process would never receive any information, and the child
would be killed before it even had a chance to run popen()), but even
with my patch, the problem would not disappear.

However, even after applying the above patch, and confirming a
performance increase, I gave debugging a quick try, to no avail so far,
as the child process responsible for the check seems to be able to
complete , but is still seen by the parent process as "failing". This
sounds like the check handling code is acting up, though why it would
report it as "error 127" is beyond me at this point.


Thanks for your time, and I hope the patch can help a few people.
-- 
Stephane LAPIE, EPITA SRS, Promo 2005
"Even when they have digital readouts, I can't understand them."
--MegaTokyo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch-common-macros.c
Type: text/x-csrc
Size: 2713 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110104/1b12cde7/attachment.c>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110104/1b12cde7/attachment.sig>
-------------- next part --------------
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list