Problems with many hanging Nagios processes(Nagios spawning rogue nagios processes eventually crashingNagios server)

Marc Powell marc at ena.com
Sat Jan 20 19:10:25 CET 2007


Old issue but I want to thank everyone that identified this and got a
fix into nagios. We rarely have significant outages but when we did I
would see a backlog of nagios processes (hundreds) but no passive check
results being processed. We had a network outage today and using the new
patches I was able to see that we were hitting Total Check Result
Buffers and adjust accordingly.

My problem is that while I no longer have the daemon accumulation, the
result buffer isn't being processed. I have my service_reaper_frequency
set to 2 and command_check_interval=-1 but I don't see status updates
for my passive checks that are coming in. nagiostats output below. I
have all my passive checks on a 5 minute interval and I see them coming
in but you can see below that nagios hasn't processed any of the results
in at least 5 minutes. Any suggestions would be appreciated.

Nagios Stats 2.7
Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
Last Modified: 01-19-2007
License: GPL

CURRENT STATUS DATA
----------------------------------------------------
Status File:                          /usr/local/nagios/var/status.dat
Status File Age:                      0d 0h 0m 33s
Status File Version:                  2.7

Program Running Time:                 0d 0h 14m 58s
Nagios PID:                           4208
Used/High/Total Command Buffers:      53 / 98 / 16384
Used/High/Total Check Result Buffers: 7733 / 7749 / 16384

Total Services:                       3935
Services Checked:                     3935
Services Scheduled:                   25
Active Service Checks:                25
Passive Service Checks:               3910
Total Service State Change:           0.000 / 17.570 / 1.069 %
Active Service Latency:               0.004 / 0.837 / 0.232 sec
Active Service Execution Time:        0.111 / 9.708 / 3.697 sec
Active Service State Change:          0.000 / 11.710 / 0.468 %
Active Services Last 1/5/15/60 min:   0 / 0 / 0 / 25
Passive Service State Change:         0.000 / 17.570 / 1.072 %
Passive Services Last 1/5/15/60 min:  0 / 0 / 3337 / 3910
Services Ok/Warn/Unk/Crit:            3456 / 2 / 7 / 470
Services Flapping:                    0
Services In Downtime:                 0

Total Hosts:                          2613
Hosts Checked:                        2613
Hosts Scheduled:                      0
Active Host Checks:                   2613
Passive Host Checks:                  0
Total Host State Change:              0.000 / 0.000 / 0.000 %
Active Host Latency:                  0.000 / 0.000 / 0.000 sec
Active Host Execution Time:           0.000 / 0.131 / 0.000 sec
Active Host State Change:             0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:      0 / 0 / 0 / 0
Passive Host State Change:            0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                2613 / 0 / 0
Hosts Flapping:                       0
Hosts In Downtime:                    0

> -----Original Message-----
> From: nagios-devel-bounces at lists.sourceforge.net [mailto:nagios-devel-
> bounces at lists.sourceforge.net] On Behalf Of Mahesh Kunjal
> Sent: Monday, December 18, 2006 6:43 PM
> To: nagios-devel at lists.sourceforge.net; ae at op5.se; linux-system-
> technik at de.man-mn.com
> Subject: Re: [Nagios-devel] Problems with many hanging Nagios
> processes(Nagios spawning rogue nagios processes eventually
crashingNagios
> server)
> 
> 
> 
> We had similar issue. We have a distributed environment with one
master
> and 4 slaves. Total number of hosts monitored are 1900+ and
> 20000+ services spread across 4 slaves.
> 
> At times we saw 14K or more results being sent in a second from
slaves.
> This resulted in 100+ nagios processes being created.
> 
> Changed reaper frequency to 2 seconds and played with all tunables.
> Nothing seemed to help.
> 
> Looking at the nagios source,
> This is what I found out was happening...
> 
> Nagios has a commands file worker thread and when it gets woken up,
looks
> if there is data in pipe(nagios.cmd), if exists, forks a child
process.
> This will be in a loop and checks the pipe for data.
> 
> Now what does the forked nagios child process do?
> It reads all the data from the pipe one message a time and puts it in
> commands buffer. If if is able to write to buffer, just exits.
> 
> The problem here was command buffer had a limited size of 1024. This
is
> the default setting in include/nagios.h.in and is in the line #define
> COMMAND_BUFFER_SLOTS 1024.
> 
> This was not enough and the child process started to wait for memory
to be
> freed so that the pipe data retrieved can be put in buffer.
> 
> While this child process waited for memory to be freed, the command
worker
> thread got woken up and realized that there is data in pipe and forked
> another child. This got repeated and eventually server went out of
memory.
> 
> Here is what we did to resolve.
> 
> 1. Edit the include/nagios.h.in
> change
> #define COMMAND_BUFFER_SLOTS 1024
> to
> #define COMMAND_BUFFER_SLOTS 60000
> 
> And change
> #define SERVICE_BUFFER_SLOTS 1024
> to
> #define SERVICE_BUFFER_SLOTS 60000
> 
> 2. Run ./configure
> (make sure you don't have nano second sleep enabled. Also disable perl
> interpreter)
> 
> 3. make all;make install
> 
> 
> 
> 
> 
> - Mahesh Kunjal (maheshk)
> 
> -----------------------
> This thread is located in the archive at this URL:
> http://www.nagiosexchange.org/nagios-
> devel.33.0.html?&tx_maillisttofaq_pi1[showUid]=13177
> 
> 
>
------------------------------------------------------------------------
-
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to
share
> your
> opinions on IT & business topics through brief surveys - and earn cash
>
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDE
V
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list