Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

Mahesh Kunjal mkunjal at gmail.com
Thu Dec 21 17:47:42 CET 2006


Hi Ton!

> Here is what we did to resolve.
>
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
>
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
>
>
>
> I was intrigued by this as we have a performance issue, but not with the
> same symptoms. Our problem is that NSCA processes increase when the nagios
> server is under load. They appear to be blocking on writing to the command
> pipe. Switching NSCA to single daemon mitigates the problem (slaves will
> timeout their passive results), but we wanted to know where any slow downs
> could be.

We had the NSCA related performance issues too.
We started writing to a file on the slaves, the results it gets to be
forwarded to master.
Then once every 10 or 15 seconds, send that file over to master.



On 12/21/06, Ton Voon <ton.voon at altinity.com> wrote:
> Hi Mahesh,
>
>
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
>
> Here is what we did to resolve.
>
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
>
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
>
>
>
> I was intrigued by this as we have a performance issue, but not with the
> same symptoms. Our problem is that NSCA processes increase when the nagios
> server is under load. They appear to be blocking on writing to the command
> pipe. Switching NSCA to single daemon mitigates the problem (slaves will
> timeout their passive results), but we wanted to know where any slow downs
> could be.
>
> From your findings, we've created a performance static patch, attached. This
> collects the maximum and current values for the command and service buffer
> slots and is then written to status.dat (by default every 10 seconds). What
> I found with a fake slave sending 128 results every 5 seconds was that the
> maximum values were fairly low (under 100), but when I put the server under
> load, the maximum_command_buffer_items shot up to 1969 and the
> maximum_service_buffer_items shot up to 2156 (had changed from defaults to
> your 60000).
>
> This could show if the buffer is filled at various points or if there is not
> enough data ready for Nagios to process further down the chain.
>
> I'd be interested in figures from other systems.
>
> Warning: the patch is not thread safe, so there is no guarantees that the
> statistic data will not be corrupted (but should not affect usual Nagios
> operation). Applies onto Nagios 2.5. Tested on Debian with 2.6 kernel.
>
> Ton
>
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
>
>
>
>
>
>
>
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list