Memory leak

Arno Lehmann al at its-lehmann.de
Mon Apr 11 10:35:13 CEST 2005


As I can't see this mail in the archive and Ians question might be 
related to this...
Once more.

Hello.

Well, I did something Really Very Stupid and sent the following to me,
not to the list... ok, that happens, but I didn't even notice.
------------

As promised...

Arno Lehmann wrote:

>> Run it through valgrind and log everything. Post the logs on some 
public webpage so users with little or no interest doesn't have to cope
with them on the list.
>
>
>
> Doing it just now... wait some time, and I'll post the URL.


http://www.lehleute.de/nagios-valgrind-err.txt

has the valgrind output. I understand only small parts of it...

This file is about 1.4 MB in size and resulted from a nagios run with my
test config: 22 Hosts, 22 services, checks by check_icmp and
check_dummy, no notifications, state retention file exists and fits the
config file (hopefully...) Nagios ran for only a few minutes but still
"ate" some memory.

Created with:
valgrind --tool=memcheck --leak-check=yes nagios-2.0b2/base/nagios
/usr/local/nagios/etc/nagios.cfg >nagios-valgrind.txt
2>nagios-valgrind-err.txt

Nagios is version 2.0b2 compiled from tarball:

> elf:~ # more nagios-2.0b2/config.status
> #! /bin/sh
> # Generated automatically by configure.
> # Run this file to recreate the current configuration.
> # This directory was configured as follows,
> # on host elf:
> #
> # ./configure


> elf:~ # valgrind --version
> valgrind-2.2.0


I hope someone can use this output to find the (possible) memory leaks.

Arno


------------

Well, time has passed and I worked on the problem myself. I installed
-b03 and kept everything the same except the binaries.

Now, memory usage still goes up when Nagios runs - about 220 MB in
hours - but then it stays at that level. About 10 hours after stopping
Nagios the memory usage goes down again in "steps" to a "normal" level.
To me, this looks like some kernel memory issue, like open sockets with
a long timeout before they're shut down or something. Still, everything
with 22 hosts and 22 (more or less) dummy checks. I'm reactivating the
normal service checks one by one now...

Anyway, something inside Nagios has obviously changed and it has an
effect on Nagios' or the kernels memory consumption.

Arno

Andreas Ericsson wrote:

> Arno Lehmann wrote:
> 
>> Hi.
>>
>> Andreas Ericsson wrote:
>>
>>> The kernel uses memory, and most os's implement copy-on-write with 
>>> forked processes (Linux does this, and judging by the apps running 
>>> that's what you're using). That means only changed frames are 
>>> actually copied on a fork(), but the theoretical maximum consumption 
>>> (as determined by allocated buffers in the master process) is 
>>> displayed anyways.
>>
>>
>>
>> Errm - sure. Anyway, what I see is that the memory claimed by 
>> processes is far less than what the kernel says is used.
>>
> 
> This is because free and friends show what's available to a program 
> running on the system. Removed from that pool is memory hogged by 
> graphic drivers that shadow ram, and the kernels own memory. Large 
> routing tables, software raid and stateful in-kernel firewalls are three 
> of the most common causes for "disappearing" memory. If nagios had had a 
> leak it's process size would grow abnormally and most likely fairly 
> rapidly. In short, memory wouldn't be "missing", it would be assigned to 
> a process that usually doesn't claim that much of it.
> 
>>>> Any other ideas?
>>>>
>>>
>>> Run it through valgrind and log everything. Post the logs on some 
>>> public webpage so users with little or no interest doesn't have to 
>>> cope with them on the list.
>>
>>
>>
>> Doing it just now... wait some time, and I'll post the URL.
>>
> 
> Excellent.
> 
>> One question, though:
>> I get output like the following
>>
>>> ==30154== Syscall param socketcall.sendto(msg) contains uninitialised 
>>> or unaddressable byte(s)
>>> ==30154==    at 0x1BA4A4E1: sendto (in /lib/tls/libc.so.6)
>>> ==30154==    by 0x1BA33FB6: getaddrinfo (in /lib/tls/libc.so.6)
>>> ==30154==    by 0x1BC00521: ldap_connect_to_host (in 
>>> /usr/lib/libldap-2.2.so.7.0.8)
>>> ==30154==    by 0x1BBEACDC: ldap_int_open_connection (in 
>>> /usr/lib/libldap-2.2.so.7.0.8)
>>> ==30154==  Address 0x52BFD07D is on thread 1's stack
>>> Nagios 2.0b2 starting... (PID=30154)
>>> ==30160==
>>> ==30160== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 41 
>>> from 1)
>>> ==30160== malloc/free: in use at exit: 1250742 bytes in 122 blocks.
>>> ==30160== malloc/free: 17115 allocs, 16993 frees, 2215553 bytes 
>>> allocated.
>>> ==30160== For counts of detected errors, rerun with: -v
>>> ==30160== searching for pointers to 122 not-freed blocks.
>>> ==30158==
>>> ==30158== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 41 
>>> from 1)
>>> ==30158== malloc/free: in use at exit: 1250742 bytes in 122 blocks.
>>> ==30158== malloc/free: 17146 allocs, 17024 frees, 2215788 bytes 
>>> allocated.
>>> ==30158== For counts of detected errors, rerun with: -v
>>> ==30158== searching for pointers to 122 not-freed blocks.
>>> ==30160== checked 2197636 bytes.
>>> ==30160==
>>> ==30160==
>>> ==30160== 8 bytes in 2 blocks are definitely lost in loss record 1 of 18
>>> ==30160==    at 0x1B903BAC: malloc (in 
>>> /usr/lib/valgrind/vgpreload_memcheck.so)
>>> ==30160==    by 0x1B8E9F5E: _dl_map_object_from_fd (in /lib/ld-2.3.3.so)
>>> ==30160==    by 0x1B8EACC9: _dl_map_object (in /lib/ld-2.3.3.so)
>>> ==30160==    by 0x1B8F09CD: openaux (in /lib/ld-2.3.3.so)
>>> ==30160==
>>> ==30160==
>>> ==30160== 37 bytes in 2 blocks are definitely lost in loss record 4 
>>> of 18
>>> ==30160==    at 0x1B903BAC: malloc (in 
>>> /usr/lib/valgrind/vgpreload_memcheck.so)
>>> ==30160==    by 0x1B9F7CAF: strdup (in /lib/tls/libc.so.6)
>>> ==30160==    by 0x807753E: add_host_notification_command_to_contact 
>>> (objects.c:2465)
>>> ==30160==    by 0x8084C95: xodtemplate_register_contact 
>>> (xodtemplate.c:7800)
>>> ==30160==
>>> ==30160==
>>> ==30160== 41 bytes in 2 blocks are definitely lost in loss record 5 
>>> of 18
>>> ==30160==    at 0x1B903BAC: malloc (in 
>>> /usr/lib/valgrind/vgpreload_memcheck.so)
>>> ==30160==    by 0x1BC1D900: ???
>>> ==30160==    by 0x1BC1DA58: ???
>>> ==30160==    by 0x1BC05149: ???
>>
>>
>>
>> The last block contains addresses, but not code lines. Is that normal?
> 
> 
> Yes. It happens whenever the eip enters a library that hasn't got any 
> debug symbols, or if the binary is stripped and you don't have a symbol 
> table to load in to valgrind (you need to get the symbol table *before* 
> stripping for valgrind to be able to use it).
> 
>> I assume that's kernel space, but I'm not sure about anything - 
>> valgrinds output is quite crypic to me. Above, I have the code lines 
>> and function names.
>>
> 
> Kernel space doesn't have debug symbols attached, ofcourse, so that 
> could be it.
> 
>> Arno
>>
>>>> Arno
>>>>
>>>
>>
> 

-- 
IT-Service Lehmann                    al at its-lehmann.de
Arno Lehmann                  http://www.its-lehmann.de





-- 
IT-Service Lehmann                    al at its-lehmann.de
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list