[PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report

Andreas Ericsson ae at op5.se
Tue Jan 4 10:23:52 CET 2011


On 01/04/2011 08:43 AM, Thomas Guyot-Sionnest wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 11-01-03 10:37 PM, Stephane LAPIE wrote:
>> Hello list,
>>
>> I apologize in advance should this topic have already been raised in the
>> past.
>>
>>
>>
>> We make fairly intensive use of Nagios at our company (around 1700
>> machines, for 26000 services), using a cluster of OpenBSD machines.
>>
>> We do distribution using NSCA (a re-made Ruby implementation of the
>> server), and external handler programs to offload sending the packets
>> (which leaves to Nagios the sole task of writing results to a named pipe).
>>
>> While tuning my configuration and creating several service groups
>> (simply for display purposes), I stumbled upon several problems :
>>
>> 1) An actual bug : Beyond a certain number of members, Nagios simply
>> fumbles at handling service checks for affected services within its
>> child processes, and then reports the failure with a very misleading
>> error message : "Warning : Return code 127 was out of bounds. Make sure
>> the plugin you're trying to run actually exists". (when the EXACT same
>> configuration, minus service groups, works perfectly fine)
>>
>> I haven't pinpointed the final cause for this one, and I think I have
>> simply found a triggering case, but this seems to hint at a deeper
>> problem in the check handling. (Additionally, the message associated
>> with code 127 should be made more accurate, as I spent several days
>> figuring if any combination of funny PATH environment variables and such
>> could prevent the execution of my scripts)
>>
>> As a temporary fix for my setup, I removed the related servicegroups
>> entries, and I am running fine for now, but I am hoping this will be
>> fixed in a future version, as this is really more than just a small
>> annoyance. :(
> [...]
>> Further about the aforementioned bug :
>>
>> I somehow have a value at which (and probably beyond which) the bug can
>> be reproduced (but it does not seem to be the direct cause). The
>> "symptoms" can be tracked down to MACRO_SERVICEGROUPMEMBERS generating a
>> 338084 bytes string (35 services, assigned to 294 machines via templates).
> 
> I believe this bug might have to do with the actual command line length
> passed to popen. Is it possible somehow this macro ends up on the
> command line?
> 

Only if there are overflow bugs. There aren't.

Stephane has enabled environment macros and the list of servicegroup
members combined with the command line arguments exceeds the limit
(which is usually 128KiB by default and may require a kernel re-compile
to increase). Environment variables and command-line parameters share
the same block of memory to stash their data, so larger environment
variables means less space for the command line. In this case, environment
variables take up all available space, so the command line to be executed
ends up being the nul string, which properly generates the "command not
found" error.

Stephane: If you disable environment macros (which you should anyway,
since they soak up tremendous amounts of cpu-time for each check), the
issue should go away until you try to use the SERVICEGROUPMEMBERS macro
on the command line for servicegroups that are spectactularly huge.

The problem with environment macros is that Nagios has no way of finding
out which of them the script will need, so it has to calculate all of them.
Using them in any kind of large-ish setup (or even a small one that might
grow) should be considered a configuration bug.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl




More information about the Developers mailing list