Memory leak

Arno Lehmann al at its-lehmann.de
Mon May 16 19:34:44 CEST 2005


Hi,

first, excuse the crosspost - I asked at the user list some time ago, 
but without a useful result.

Now, I've got a problem with Nagios 2.0b3 (with b2 as well, but I'm 
trying b3 at the moment).

I noticed that the amount of used memory rises without end when Nagios 
runs. I was able to find out the following:
- If I've got only one host, one service memory usage stays constant
- As soon as I add a second service, it goes up.
- The more services or hosts, or the higher the check frequency, the 
faster the memory usage rises.
- No tool I know (like top or ps) can tell me where the memory goes (or 
rather, which process it's used by).
- The memory usage does not go down as soon as I kill the Nagios 
process, it can take between some hours and the next reboot. If I start 
a process that requests more meory than physically available, i.e. I 
force the system to swap, it gets freed.
- If I simply let the system run, the kernel out-of-memory reaper starts 
killing processes, though.

I've got the following system:
> elf:~ # uname -a
> Linux elf 2.6.8-24.14-default #1 Tue Mar 29 09:27:43 UTC 2005 i686 athlon i386 GNU/Linux

> elf:/usr/local/nagios # bin/nagios  etc/nagios-mini.cfg
> 
> Nagios 2.0b3
> Copyright (c) 1999-2005 Ethan Galstad (www.nagios.org)
> Last Modified: 04-03-2005
> License: GPL
> 
> Nagios 2.0b3 starting... (PID=20489)

> elf:~ # ldd /usr/local/nagios/bin/nagios
>         linux-gate.so.1 =>  (0xffffe000)
>         libm.so.6 => /lib/tls/libm.so.6 (0x4002e000)
>         libnsl.so.1 => /lib/libnsl.so.1 (0x40051000)
>         libpthread.so.0 => /lib/tls/libpthread.so.0 (0x40068000)
>         libltdl.so.3 => /usr/lib/libltdl.so.3 (0x4007a000)
>         libc.so.6 => /lib/tls/libc.so.6 (0x40081000)
>         /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000)
>         libdl.so.2 => /lib/libdl.so.2 (0x40197000)
so, no embedded Perl, I guess.

The system is a 500MHz athlon, 512 MB RAM, IDE disk which serves as an 
all-purpose-server and works fine, so I'm quite sure the OS and the 
hardware are more or less ok.

You find the configuration I use for testing below.

Now, I assume there is some sort of memory leak, either in Nagios itself 
or in the kernel.

I don't think it's the plugins - first, I tried several, also some 
simply shell script like 'echo OK; exit 0' and I verified them using 
valgrind.

Using valgrind, I do get lots of output - unfortunately, I'm not a 
programmer, so it is more or less impossible for me to understand that.

Seeing that Nagios is a very useful project and my good experiences with 
version 1.x I'd reallylike to be able to upgrade to version 2, as well 
as help getting it running on a wider range of systems.

Now, I assume that usually version 2.0b runs ok, because I see no other 
problem reports. I'm wondering if anyone can give me some advice how to 
solve these problems.

Of course, I can supply log files etc. or do test runs with different 
configurations.

Arno

----------
Here's my current configuration:

> elf:~ # cat /usr/local/nagios/etc/nagios-mini.cfg
> log_file=/usr/local/nagios/var/nagios.log
> cfg_file=/usr/local/nagios/etc/mini.cfg
> object_cache_file=/usr/local/nagios/var/objects.cache
> resource_file=/usr/local/nagios/etc/resource.cfg
> status_file=/usr/local/nagios/var/status.dat
> nagios_user=nagios
> nagios_group=nagios
> command_check_interval=30s
> command_file=/usr/local/nagios/var/rw/nagios-test.cmd
> comment_file=/usr/local/nagios/var/comments.dat
> downtime_file=/usr/local/nagios/var/downtime.dat
> lock_file=/usr/local/nagios/var/nagios.lock
> temp_file=/usr/local/nagios/var/nagios.tmp
> log_rotation_method=d
> log_archive_path=/usr/local/nagios/var/archives
> use_syslog=0
> log_notifications=1
> log_service_retries=1
> log_host_retries=1
> log_event_handlers=1
> log_initial_states=1
> log_external_commands=1
> log_passive_checks=1
> service_inter_check_delay_method=s
> max_service_check_spread=60
> service_interleave_factor=s
> host_inter_check_delay_method=s
> max_host_check_spread=60
> max_concurrent_checks=20
> service_reaper_frequency=2
> service_check_timeout=30
> host_check_timeout=60
> event_handler_timeout=30
> notification_timeout=60
> ocsp_timeout=5
> perfdata_timeout=5
> retain_state_information=1
> state_retention_file=/usr/local/nagios/var/retention.dat
> retention_update_interval=120
> use_retained_program_state=1
> use_retained_scheduling_info=1
> interval_length=2
> use_aggressive_host_checking=0
> execute_service_checks=1
> accept_passive_service_checks=0
> execute_host_checks=1
> accept_passive_host_checks=0
> enable_notifications=1
> enable_event_handlers=0
> process_performance_data=0
> obsess_over_services=0
> obsess_over_hosts=0
> check_for_orphaned_services=0
> check_service_freshness=0
> service_freshness_check_interval=300
> check_host_freshness=0
> host_freshness_check_interval=1500
> aggregate_status_updates=0
> status_update_interval=24
> enable_flap_detection=0
> low_service_flap_threshold=5.0
> high_service_flap_threshold=20.0
> low_host_flap_threshold=5.0
> high_host_flap_threshold=20.0
> date_format=strict-iso8601
> illegal_object_name_chars=`~!$%^&*|'"<>?,()=
> illegal_macro_output_chars=`~$&|'"<>
> use_regexp_matching=0
> use_true_regexp_matching=0
> admin_email=its-admin at its-lehmann.de
> admin_pager=<nicht vorhanden>
> daemon_dumps_core=0

> elf:~ # cat /usr/local/nagios/etc/mini.cfg
> define command{
>         command_name    check-host-alive
> #        command_line    sudo -u root $USER1$/check_icmp -H $HOSTADDRESS$ -w 300.0,30% -c 500.0,70% -p 10
>         command_line    $USER1$/check_dummy 0 Immer_ok_dafür_sorg_ich_schon
>         }
> 
> define command{
>         command_name    check_dhcp
>         command_line    sudo -u root $USER1$/check_dhcp --serverip=$ARG1$
> }
> 
> define command{
>         command_name    check_local_disk
>         command_line    $USER1$/check_disk -w $ARG1$ -c $ARG2$ -p $ARG3$
> }
> 
> 
> define command{
>         command_name    nix
>         command_line    /bin/true
> }
> 
> define host{
>         host_name               Elf
>         alias                   Elf
>         address                 192.168.0.4
>         check_command           check-host-alive
>         max_check_attempts      2
>         check_interval          12
>         check_period            24x7
>         contact_groups          admins
>         notification_interval   22
>         notification_period     24x7
>         notification_options    d,u,r,f
> }
> 
> define hostgroup{
>         hostgroup_name  alles
>         alias           Alles
>         members         Elf
> }
> 
> #Using both checks results in an increasing memory usage.
> 
> #If I use this service alone there's no increase in MemUsage
> #define service{
> #       host_name               Elf
> #       service_description     DHCP
> #       check_command           check_dhcp!192.168.0.4
> #       max_check_attempts      2
> #       normal_check_interval   1
> #       retry_check_interval    1
> #       check_period            24x7
> #       notification_interval   22
> #       notification_period     24x7
> #       notification_options    w,u,c,r,f
> #       contact_groups          admins
> #}
> 
> #This one alone is ok.
> define service{
>         host_name               Elf
>         service_description     DISK
>         check_command           check_local_disk!10%!5%!/
>         max_check_attempts      2
>         normal_check_interval   1
>         retry_check_interval    1
>         check_period            24x7
>         notification_interval   22
>         notification_period     24x7
>         notification_options    w,u,c,r,f
>         contact_groups          admins
> }
> 
> define service{
>         host_name               Elf
>         service_description     DISK2
>         check_command           check_local_disk!10%!5%!/tmp
>         max_check_attempts      2
>         normal_check_interval   1
>         retry_check_interval    1
>         check_period            24x7
>         notification_interval   22
>         notification_period     24x7
>         notification_options    w,u,c,r,f
>         contact_groups          admins
> }
> 
> 
> define contactgroup{
>         contactgroup_name       admins
>         alias                   Administrators
>         members                 admin
> }
> 
> 
> define contact{
>         contact_name                    admin
>         alias                           Admins
>         email                           admin at elf
>         host_notification_period        24x7
>         service_notification_period     24x7
>         host_notification_options       d,u,r,f,n
>         service_notification_options    w,u,c,r,f,n
>         service_notification_commands   nix
>         host_notification_commands      nix
> }
> 
> define timeperiod{
>         timeperiod_name 24x7
>         alias           Always
>         sunday          00:00-24:00
>         monday          00:00-24:00
>         tuesday         00:00-24:00
>         wednesday       00:00-24:00
>         thursday        00:00-24:00
>         friday          00:00-24:00
>         saturday        00:00-24:00
> }


-- 
IT-Service Lehmann                    al at its-lehmann.de
Arno Lehmann                  http://www.its-lehmann.de


-------------------------------------------------------
This SF.Net email is sponsored by Oracle Space Sweepstakes
Want to be the first software developer in space?
Enter now for the Oracle Space Sweepstakes!
http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list