[PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report

Andreas Ericsson ae at op5.se
Tue Jan 4 10:38:07 CET 2011

Previous message: [PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report
Next message: [PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 01/04/2011 04:37 AM, Stephane LAPIE wrote:
> Hello list,
> 
> I apologize in advance should this topic have already been raised in the
> past.
> 
> 
> 
> We make fairly intensive use of Nagios at our company (around 1700
> machines, for 26000 services), using a cluster of OpenBSD machines.
> 
> We do distribution using NSCA (a re-made Ruby implementation of the
> server), and external handler programs to offload sending the packets
> (which leaves to Nagios the sole task of writing results to a named pipe).
> 

http://www.op5.org/community/plugin-inventory/op5-projects/merlin
http://git.op5.org/git/?p=nagios/merlin.git;a=blob;f=HOWTO;hb=master
http://git.op5.org/git/?p=nagios/merlin.git;a=blob;f=README;hb=HEAD

Make especially sure you read the first paragraph of the README.

> While tuning my configuration and creating several service groups
> (simply for display purposes), I stumbled upon several problems :
> 
> 1) An actual bug : Beyond a certain number of members, Nagios simply
> fumbles at handling service checks for affected services within its
> child processes, and then reports the failure with a very misleading
> error message : "Warning : Return code 127 was out of bounds. Make sure
> the plugin you're trying to run actually exists". (when the EXACT same
> configuration, minus service groups, works perfectly fine)
> 
> I haven't pinpointed the final cause for this one, and I think I have
> simply found a triggering case, but this seems to hint at a deeper
> problem in the check handling. (Additionally, the message associated
> with code 127 should be made more accurate, as I spent several days
> figuring if any combination of funny PATH environment variables and such
> could prevent the execution of my scripts)
> 
> As a temporary fix for my setup, I removed the related servicegroups
> entries, and I am running fine for now, but I am hoping this will be
> fixed in a future version, as this is really more than just a small
> annoyance. :(
> 

Disable environment macros instead. If you're not using that macro on
the command-line, your checks will continue to work. It's not a bug in
Nagios, as such, it's just that environment variables and command line
shares memory space, and that space is limited. For your 300k+ list of
servicegroup members, you exhaust that space very quickly, and check
execution fails.

> 
> 2) A performance problem : The MACRO_SERVICEGROUPMEMBERS code is
> painfully slow and extremely costly in CPU performance. The attached
> patch file is my attempt at fixing the most obvious issues :
>   - Repetitive malloc/realloc (I initially caught on this by ktrace-ing
> the processes and realizing Nagios was mapping/unmapping a lot of memory).
>   - Repetitive string duplications and length calculations
> 
> The above code has been tested for a few hours on a busy Nagios setup
> and performs much faster, as expected. (Reduction of several thousands
> of malloc/realloc calls to 1, by initally calculating the memory size to
> be allocated, thus avoiding unneeded system calls and memory areas
> duplication)
> 

Nice patch. I'll apply it tomorrow when it's my Nagios day. Any chance
you could whip up something similar for HOSTGROUPMEMBERS until then?

> 
> 3) Which brings me to a feature request : Nagios does not cache the
> generation output of standard macros such as service group members
> (derivated from configuration, and therefore static within any given
> Nagios process), and has to go through the process of regenerating the
> list every single time a child process is executed and environment
> macros are set. This is extremely time-consuming, and further
> performance improvements could be achieved through this.
> 

Such a performance increase would come at a fairly costly price though,
since Nagios fork()'s each time it runs a check and the memory would
be duplicated to each child. Most of it should be shared on Linux, but
for Solaris, BSD and others it might prevent Nagios from running
altogether, and it would be a complete and utter waste to stash them
if environment variables are turned off and they're never used in a
real command, as is expected for large installations.

> I might try implementing it on my own, but I would appreciate a few
> pointers as to whether there is a framework or previous work within
> Nagios that would facilitate the job.
> 

Not beyond the hashing code, no, and that isn't very good as-is.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl

Previous message: [PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report
Next message: [PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Developers mailing list