Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

Ethan Galstad nagios at nagios.org
Thu Dec 21 17:56:49 CET 2006


Good work on nailing down the problem to the command buffer slots! 
Sounds like this problem might affect a number of users, so I think we 
need to patch Nagios. There are two possible solutions:

1.  Bump up the default buffer slots to something larger.  Since Nagios 
only immediately allocates memory for pointers, the additional memory 
overhead is fairly small.  Allocated memory = (sizeof(char **)) * (# of 
slots).

2.  Moving the slots definitions out to command file variables.  This is 
a better solution than having to edit the code and recompile.

Thoughts?


Ton Voon wrote:
> Hi Mahesh,
> 
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
> 
>> Here is what we did to resolve.
>>
>> 1. Edit the include/nagios.h.in
>> change
>> #define COMMAND_BUFFER_SLOTS 1024
>> to
>> #define COMMAND_BUFFER_SLOTS 60000
>>
>> And change
>> #define SERVICE_BUFFER_SLOTS 1024
>> to
>> #define SERVICE_BUFFER_SLOTS 60000
>>
> 
> I was intrigued by this as we have a performance issue, but not with the 
> same symptoms. Our problem is that NSCA processes increase when the 
> nagios server is under load. They appear to be blocking on writing to 
> the command pipe. Switching NSCA to single daemon mitigates the problem 
> (slaves will timeout their passive results), but we wanted to know where 
> any slow downs could be.
> 
>  From your findings, we've created a performance static patch, attached. 
> This collects the maximum and current values for the command and service 
> buffer slots and is then written to status.dat (by default every 10 
> seconds). What I found with a fake slave sending 128 results every 5 
> seconds was that the maximum values were fairly low (under 100), but 
> when I put the server under load, the maximum_command_buffer_items shot 
> up to 1969 and the maximum_service_buffer_items shot up to 2156 (had 
> changed from defaults to your 60000).
> 
> This could show if the buffer is filled at various points or if there is 
> not enough data ready for Nagios to process further down the chain.
> 
> I'd be interested in figures from other systems.
> 
> Warning: the patch is not thread safe, so there is no guarantees that 
> the statistic data will not be corrupted (but should not affect usual 
> Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 
> kernel.
> 
> Ton
> 
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
> 


Ethan Galstad,
Nagios Developer
---
Email: nagios at nagios.org
Website: http://www.nagios.org

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list