[PATCH] common/macros.c:2185:grab_standard_servicegroup_macro() speed up & Service check execution problem report

Stephane LAPIE stephane.lapie at darkbsd.org
Tue Jan 4 08:59:26 CET 2011


On 01/04/2011 04:43 PM, Thomas Guyot-Sionnest wrote:
> On 11-01-03 10:37 PM, Stephane LAPIE wrote:
>> Hello list,
> 
>> I apologize in advance should this topic have already been raised in the
>> past.
> 
> 
> 
>> We make fairly intensive use of Nagios at our company (around 1700
>> machines, for 26000 services), using a cluster of OpenBSD machines.
> 
>> We do distribution using NSCA (a re-made Ruby implementation of the
>> server), and external handler programs to offload sending the packets
>> (which leaves to Nagios the sole task of writing results to a named pipe).
> 
>> While tuning my configuration and creating several service groups
>> (simply for display purposes), I stumbled upon several problems :
> 
>> 1) An actual bug : Beyond a certain number of members, Nagios simply
>> fumbles at handling service checks for affected services within its
>> child processes, and then reports the failure with a very misleading
>> error message : "Warning : Return code 127 was out of bounds. Make sure
>> the plugin you're trying to run actually exists". (when the EXACT same
>> configuration, minus service groups, works perfectly fine)
> 
>> I haven't pinpointed the final cause for this one, and I think I have
>> simply found a triggering case, but this seems to hint at a deeper
>> problem in the check handling. (Additionally, the message associated
>> with code 127 should be made more accurate, as I spent several days
>> figuring if any combination of funny PATH environment variables and such
>> could prevent the execution of my scripts)
> 
>> As a temporary fix for my setup, I removed the related servicegroups
>> entries, and I am running fine for now, but I am hoping this will be
>> fixed in a future version, as this is really more than just a small
>> annoyance. :(
> [...]
>> Further about the aforementioned bug :
> 
>> I somehow have a value at which (and probably beyond which) the bug can
>> be reproduced (but it does not seem to be the direct cause). The
>> "symptoms" can be tracked down to MACRO_SERVICEGROUPMEMBERS generating a
>> 338084 bytes string (35 services, assigned to 294 machines via templates).
> 
> I believe this bug might have to do with the actual command line length
> passed to popen. Is it possible somehow this macro ends up on the
> command line?

In my setup, this specific macro is never used for the concerned command
objects (the ones Nagios fails to execute).
-- 
Stephane LAPIE, EPITA SRS, Promo 2005
"Even when they have digital readouts, I can't understand them."
--MegaTokyo

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 262 bytes
Desc: OpenPGP digital signature
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110104/adb8f577/attachment.sig>
-------------- next part --------------
------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and, 
should the need arise, upgrade to a full multi-node Oracle RAC database 
without downtime or disruption
http://p.sf.net/sfu/oracle-sfdevnl
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list