Nagios 'Out Of Memory' Problems

Armistead, Raffy rarmistead at datanamicsinc.com
Thu Mar 23 19:23:13 CET 2006


I have a problem with my Nagios server constantly crashing. It keeps
outputting on the screen Out of Memory errors which causes loss of
access to the server. I can ping the box but I cannot SSH or web into it
to view any information. This has been happening increasingly more
lately. Now it is about every 2-3 days that this is occurring. We have
been adding more and more devices to the servers and this problem has
been increasing as this occurs. This is how I have it set up.

 

I have a Main Nagios server that is running the latest 2.0 (stable)
Nagios release. It is monitoring about 6800 devices but it is not
actively checking the devices. Its main role is to provide a web
interface and receive passive polls from three other servers which do
the polling. The main server also does email notifications when a device
goes down. The server sends about 30-40 emails a day. I am using NSCA
2.5 between the server and the client Nagios servers. I am only
monitoring one service for each device which is either TCP or ping
depending on the device. Mostly all devices are monitored with TCP
(roughly 6000). The rest are monitored with ping. The individual servers
are pretty evenly spread with the number of devices. They are about
2000-2500 each.

 

All the servers are just basic computers, Dell Dimension 2400s with base
hardware. The main server was upgraded to 2GB RAM while the other
servers are running 512MB each. They are all running Celeron 2.4 GHz
processors. The individual servers are not having out of memory problems
and they are running the latest 2.0 (stable) release as well. They all
run RedHat 9.0 with everything installed for the packages.

 

Can someone please help me in resolving this problem? Thanks.

 

 

 

 

 

 

The TOP process does not appear like it is running out of memory. This
is the normal output when the server has been running for a few hours.

57 processes: 54 sleeping, 3 running, 0 zombie, 0 stopped

CPU states:  41.1% user  58.8% system   0.0% nice   0.0% iowait   0.0%
idle

Mem:  2063556k av,  285940k used, 1777616k free,       0k shrd,   41056k
buff

                    177644k actv,   51688k in_d,   10892k in_c

Swap: 1044184k av,       0k used, 1044184k free                  114208k
cached

 

 

 

Here is a sample configuration that I have on the devices on the main
server:

 

hosts.cfg

define host {

name                           generic-host     ; The name of this host
template - referenced in other host definitions, used for template
recursion/resolution

notifications_enabled          1        ; Host notifications are enabled

event_handler_enabled          0        ; Host event handler is enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information
across program restarts

retain_nonstatus_information   1        ; Retain non-status information
across program restarts

max_check_attempts             10

notification_interval          720

notification_period            24x7

obsess_over_host               0

notification_options           d,u,r,f

register                       0        ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL HOST, JUST A TEMPLATE!

}

define host {

use                            generic-host          ; Name of host
template to use

host_name                      DETAH-R1

alias                          DETAH-R1

address                        x.x.x.x

check_command                  check_ping!200,40%!10000,100%

contact_groups                 device-admins,DETAH-admins,router-admins

}

 

services.cfg

define service {

name                           generic-service  ; The 'name' of this
service template, referenced in other service definitions

active_checks_enabled          0        ; Active service checks are
enabled

passive_checks_enabled         1        ; Passive service checks are
enabled/accepted

parallelize_check              1        ; Active service checks should
be parallelized (disabling this can lead to major performance problems)

obsess_over_service            0        ; We should obsess over this
service (if necessary)

check_freshness                1        ; Default is to NOT check
service 'freshness'

freshness_threshold            1800

notifications_enabled          1        ; Service notifications are
enabled

event_handler_enabled          0        ; Service event handler is
enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information
across program restarts

retain_nonstatus_information   1        ; Retain non-status information
across program restarts

is_volatile                    0

check_period                   24x7

max_check_attempts             6

normal_check_interval          20

retry_check_interval           5

notification_interval          720

notification_period            24x7

notification_options           n

register                       0        ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL SERVICE, JUST A TEMPLATE!

}

define service {

use                            generic-service          ; Name of
service template to use

host_name                      DETAH-R1

service_description            PING

contact_groups                 device-admins,DETAH-admins,router-admins

check_command                  check_ping!200,40%!1000,100%

}

 

Here is a sample config on the individual server.

 

hosts.cfg

define host {

name                           generic-host     ; The name of this host
template - referenced in other host definitions, used for template
recursion/resolution

notifications_enabled          1        ; Host notifications are enabled

event_handler_enabled          0        ; Host event handler is enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information
across program restarts

retain_nonstatus_information   1        ; Retain non-status information
across program restarts

max_check_attempts             10

notification_interval          720

notification_period            24x7

obsess_over_host               0

notification_options           d,u,r,f

register                       0        ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL HOST, JUST A TEMPLATE!

}

define host {

use                            generic-host          ; Name of host
template to use

host_name                      DETAH-R1

alias                          DETAH-R1

address                        x.x.x.x

check_command                  check_ping!200,40%!10000,100%

contact_groups                 device-admins,DETAH-admins,router-admins

}

 

services.cfg

define service {

name                           generic-service  ; The 'name' of this
service template, referenced in other service definitions

active_checks_enabled          1        ; Active service checks are
enabled

passive_checks_enabled         1        ; Passive service checks are
enabled/accepted

parallelize_check              1        ; Active service checks should
be parallelized (disabling this can lead to major performance problems)

obsess_over_service            1        ; We should obsess over this
service (if necessary)

check_freshness                1        ; Default is to NOT check
service 'freshness'

freshness_threshold            1800

notifications_enabled          1        ; Service notifications are
enabled

event_handler_enabled          0        ; Service event handler is
enabled

flap_detection_enabled         1        ; Flap detection is enabled

process_perf_data              1        ; Process performance data

retain_status_information      1        ; Retain status information
across program restarts

retain_nonstatus_information   1        ; Retain non-status information
across program restarts

is_volatile                    0

check_period                   24x7

max_check_attempts             6

normal_check_interval          20

retry_check_interval           5

notification_interval          720

notification_period            24x7

notification_options           n

register                       0        ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL SERVICE, JUST A TEMPLATE!

}

define service {

use                            generic-service          ; Name of
service template to use

host_name                      DETAH-R1

service_description            PING

contact_groups                 device-admins,DETAH-admins,router-admins

check_command                  check_ping!200,40%!1000,100%

}

 

Raffy

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20060323/d4c0c140/attachment.html>


More information about the Users mailing list