Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

Mahesh Kunjal mkunjal at gmail.com
Tue Dec 19 01:34:32 CET 2006


---------- Forwarded message ----------
From: Mahesh Kunjal <mkunjal at gmail.com>
Date: Dec 18, 2006 2:58 PM
Subject: Re: [Nagios-devel] Problems with many hanging Nagios
processes (Nagios spawning rogue nagios processes eventually crashing
Nagios server)
To: nagios-devel at lists.sourceforge.net, mkunjal at gmail.com


Hi Andreas

We had similar issue. We have a distributed environment with one
master and 4 slaves. Total number of hosts monitored are 1900+ and
20000+ services spread across 4 slaves.

At times we saw 14K or more results being sent in a second from
slaves. This resulted in 100+ nagios processes being created.

Changed reaper frequency to 2 seconds and played with all tunables.
Nothing seemed to help.

Looking at the nagios source,
This is what I found out was happening...

Nagios has a commands file worker thread and when it gets woken up,
looks if there is data in pipe(nagios.cmd), if exists, forks a child
process. This will be in a loop and checks the pipe for data.

Now what does the forked nagios child process do?
It reads all the data from the pipe one message a time and puts it in
commands buffer. If if is able to write to buffer, just exits.

The problem here was command buffer had a limited size of 1024. This
is the default setting in include/nagios.h.in and is in the line
#define COMMAND_BUFFER_SLOTS 1024.

This was not enough and the child process started to wait for memory
to be freed so that the pipe data retrieved can be put in buffer.

While this child process waited for memory to be freed, the command
worker thread got woken up and realized that there is data in pipe and
forked another child. This got repeated and eventually server went out
of memory.

Here is what we did to resolve.

1. Edit the include/nagios.h.in
change
#define COMMAND_BUFFER_SLOTS 1024
to
#define COMMAND_BUFFER_SLOTS 60000

And change
#define SERVICE_BUFFER_SLOTS 1024
to
#define SERVICE_BUFFER_SLOTS 60000

2. Run ./configure
(make sure you don't have nano second sleep enabled. Also disable perl
interpreter)

3. make all;make install




- Mahesh Kunjal (maheshk)

-----------------------
This thread is located in the archive at this URL:
http://www.nagiosexchange.org/nagios-devel.33.0.html?&tx_maillisttofaq_pi1[showUid]=13177

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list