[naemon-users] Error: Failed to allocate memory in nm_malloc

andre.weidemann at deutschebahn.com andre.weidemann at deutschebahn.com
Tue Sep 8 10:59:21 CEST 2015


Hi,

last week I cloned naemon from github and compiled it. I set it up with 
gearmand and mod_gearman workers. There are 600 workers min and 1000 
allowed. Our naemon setup contains ~3800 Servers with ~73000 Checks (see 
stats at the end). The status.dat file resides on a 256MB RAM-disk which 
is filled up to 42%.
Naemon is running using Thruk as a frontend. Everything works smoothly 
except for one little problem. After about 2 days naemon crashes, because 
it cannot allocate any more RAM.


Here is our setup:

I configured naemon like this:
git clone https://github.com/naemon/naemon.git
cd naemon
./configure --prefix=/.../naemon --with-naemon-user=someuser 
--with-naemon-group=somegroup

It is running on a 64bit SLES11-SP3 machine with 24GB RAM and 2x Intel(R) 
Xeon(R) E5506.

We are using gearmand-1.1.12 and mod_gearman 2.1.3 (git). Both were 
compiled from source.

gearmand-1.1.12:
./configure
make -j 12
make install

mod_gearman (2.1.3/git clone):
git clone https://github.com/sni/mod_gearman.git
cd mod_gearman
./configure --prefix=/.../naemon --with-user=someuser
make -j 12
make install


Saturday night naemon ran out of memory. According to "sar -r" it looked 
like this right before the crash:
22:50:02 149888 24314448 99.39 3660 398776 11191480 34.07

After naemon crashed with:

[1440881992] Error: Failed to allocate memory in nm_malloc

all memory was freed again:

23:00:01 22620248 1844088 7.54 4180 330768 2874272 8.75

nameon.cfg contains two broker_modules:

broker_module=/.../naemon/lib/mod_gearman2/mod_gearman2.o 
keyfile=/.../naemon/etc/secret.txt server=localhost:4730 eventhandler=yes 
hosts=yes services=yes config=/.../naemon/etc/mod_gearman2/module.conf

broker_module=/.../naemon/lib/naemon-livestatus/livestatus.so 
/.../naemon/var/cache/naemon/live

Additionally I set:
use_large_installation_tweaks=1
daemon_dumps_core=1


The tweak option, does not seem to have any effect on the memory 
consumption.
Unfortunately no core dumps are generated even though I set
ulimit -c unlimited

I even created a separate filesystem for the core dump and pointed to it:
echo '/.../coredump/core_%e.%p' > /proc/sys/kernel/core_pattern


Running watch "ps -eo pid,pmem,rss,vsz,comm|sort -rn -k3|head -1"
clearly shows, that rss for the main naemon process is constantly growing.

After 7 and a half hours naemon grew to more than 3GB in memory:
1343 14.9 3659236 3784332 naemon

After 1day and 14 hours it has grown to over 18GB:
1343 78.1 19108300 19259592 naemon


How can I find out what is causing it to grow so big?


Here is some additional info:


naemon git: -> git show:

commit de1e51acf0cf06b352754ed71fbb332abf86fc4e
Merge: 36b3c16 d84d321
Author: Sven Nierlein <sven at nierlein.org>
Date: Mon Aug 17 13:02:18 2015 +0200

Merge pull request #47 from glensc/patch-1

Update README.md

mod_gearman: -> git show:

commit b6e4a41f78a5cf0f030a8fdb894eab1bfaca47c1
Author: Sven Nierlein <Sven.Nierlein at consol.de>
Date: Wed Jul 1 13:58:22 2015 +0200

release 2.1.3

Stats:
CURRENT STATUS DATA
------------------------------------------------------
Status File: /.../naemon/var/ramdisk/status.dat
Status File Age: 0d 0h 0m 9s
Status File Version: 1.0.3-g250db6c

Program Running Time: 0d 7h 30m 25s
Naemon PID: 1343

Total Services: 73259
Services Checked: 73259
Services Scheduled: 73255
Services Actively Checked: 73259
Services Passively Checked: 0
Total Service State Change: 0.000 / 9.610 / 0.009 %
Active Service Latency: 0.002 / 314.173 / 1.887 sec
Active Service Execution Time: 0.006 / 153.002 / 0.615 sec
Active Service State Change: 0.000 / 9.610 / 0.009 %
Active Services Last 1/5/15/60 min: 10008 / 65174 / 67458 / 67530
Passive Service Latency: 0.000 / 0.000 / 0.000 sec
Passive Service State Change: 0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min: 0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit: 72901 / 58 / 3 / 297
Services Flapping: 0
Services In Downtime: 0

Total Hosts: 3782
Hosts Checked: 3782
Hosts Scheduled: 3782
Hosts Actively Checked: 3782
Host Passively Checked: 0
Total Host State Change: 0.000 / 0.000 / 0.000 %
Active Host Latency: 0.621 / 3.429 / 0.910 sec
Active Host Execution Time: 0.007 / 0.323 / 0.042 sec
Active Host State Change: 0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min: 535 / 3714 / 3782 / 3782
Passive Host Latency: 0.000 / 0.000 / 0.000 sec
Passive Host State Change: 0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min: 0 / 0 / 0 / 0
Hosts Up/Down/Unreach: 3782 / 0 / 0
Hosts Flapping: 0
Hosts In Downtime: 0

Active Host Checks Last 1/5/15 min: 2 / 3 / 13
Scheduled: 0 / 0 / 0
On-demand: 2 / 3 / 13
Parallel: 0 / 0 / 0
Serial: 0 / 0 / 0
Cached: 2 / 3 / 13
Passive Host Checks Last 1/5/15 min: 0 / 0 / 0
Active Service Checks Last 1/5/15 min: 0 / 0 / 0
Scheduled: 0 / 0 / 0
On-demand: 0 / 0 / 0
Cached: 0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min: 0 / 0 / 0

This is how naemon.log looks right after the start:

[1441699986] Naemon 1.0.3-g250db6c starting... (PID=14922)
[1441699986] Local time is Tue Sep 08 10:13:06 CEST 2015
[1441699986] LOG VERSION: 2.0
[1441699986] qh: Socket '/.../naemon/var/naemon.qh' successfully 
initialized
[1441699986] qh: core query handler registered
[1441699986] nerd: Channel hostchecks registered successfully
[1441699986] nerd: Channel servicechecks registered successfully
[1441699986] nerd: Channel opathchecks registered successfully
[1441699986] nerd: Fully initialized and ready to rock!
[1441699986] wproc: Successfully registered manager as @wproc with query 
handler
[1441699986] wproc: Registry request: name=Core Worker 14924;pid=14924
[1441699986] wproc: Registry request: name=Core Worker 14925;pid=14925
[1441699986] wproc: Registry request: name=Core Worker 14926;pid=14926
[1441699986] wproc: Registry request: name=Core Worker 14927;pid=14927
[1441699986] wproc: Registry request: name=Core Worker 14928;pid=14928
[1441699986] wproc: Registry request: name=Core Worker 14929;pid=14929
[1441699986] wproc: Registry request: name=Core Worker 14931;pid=14931
[1441699986] wproc: Registry request: name=Core Worker 14932;pid=14932
[1441699986] wproc: Registry request: name=Core Worker 14935;pid=14935
[1441699986] wproc: Registry request: name=Core Worker 14933;pid=14933
[1441699986] wproc: Registry request: name=Core Worker 14934;pid=14934
[1441699986] wproc: Registry request: name=Core Worker 14930;pid=14930
[1441699986] livestatus: Naemon Livestatus 1.0.3-naemon Socket: 
'/.../naemon/var/cache/naemon/live'
[1441699986] livestatus: Finished initialization. Further log messages go 
to /.../naemon/var/log/naemon/livestatus.log
[1441699986] Event broker module 
'/.../naemon/lib/naemon-livestatus/livestatus.so' initialized 
successfully.
[1441699986] mod_gearman: initialized version 2.1.3 (libgearman 1.1.12)
[1441699986] Event broker module 
'/.../naemon/lib/mod_gearman2/mod_gearman2.o' initialized successfully.


Thank you very much in advance


André Weidemann
Data Center Services / Platform Unix Operations (I.LPD 73)

DB Systel GmbH
Schlachthofstraße 80, 99085 Erfurt
Tel. +49 (0)361-300-5640, intern 980-5640, Fax (0)361-300-5981
Mobil: 0160 97442245
_________________________________________________________________________________
Der DB-Konzern im Internet >> http://www.deutschebahn.com

--- Bitte denken Sie an die Umwelt, bevor Sie diese E-Mail ausdrucken. ---

Sitz der Gesellschaft: Frankfurt am Main
Registergericht: Frankfurt am Main, HRB 78707
USt-IdNr.: DE252204770
Geschäftsführer: Christa Koenen (Vorsitzende), Dr. Klaus Rüffler
Vorsitzender des Aufsichtsrates: Dr. Rolf Kranüchel


More information about the Naemon-users mailing list