nagios stops to check & orphans

Samuel Bancal sam.bancal at gmail.com
Mon Jun 8 12:51:47 CEST 2009

Previous message: re-execute an event handler just in case the service stays DOWN
Next message: Do services inherit downtime from their hosts?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I'm a new Nagios administrator (since feb 09).
Until now, every thing was quite fine. Working smoothly ... ok!

This morning I saw that during the week-end, the Nagios daemon stopped from
doing checks.
After some research (on the server and on the web), here is what I've got.
Does someone can explain me more on it ... And how not to have this problem
again ...

OS : Ubuntu server 8.04.2 LTS
Versions : nagios-3.0.6 & nagios-plugins-1.4.13
Hardware : on Vmware server infrastructure.

NTP is not set yet (I don't know if it has a side effect in my case...
Because time may be involved in the problem ...).

We're monitoring at this time 12 hosts and 64 services.

What I can see on the web interface (In scheduling Queue) :
                                        Last check                  Next
check
server_xxx                         2009-06-07 03:52:35    2009-06-07
09:19:45    Orphan     ENABLED
server_yyy    service_zzz    2009-06-07 03:50:31    2009-06-07 09:19:45
Orphan     ENABLED

All hosts and services except 2 are "orphan"...
Both "last check" and "next check" are from yesterday morning!

On the server:
$ ps auxft | grep nagios\.cfg | grep -v grep
nagios   20578  0.4 72.9 2969592 1505772 ?     Ssl  Apr30 275:20
/usr/local/nagios/bin/nagios -d /etc/nagios/nagios.cfg

-> Wow ... nagios uses 72.9% of the server's memory!

$ free
             total       used       free     shared    buffers     cached
Mem:       2062920    1636656     426264          0       4404      24532
-/+ buffers/cache:    1607720     455200
Swap:      1951888    1450744     501144

What about forks?
$ pstree -aclpn
init,1
#snip
  ├─nagios,20578 -d /etc/nagios/nagios.cfg
  │   └─{nagios},20579
#snap

What about the log ?
In /var/nagios/archives/nagios-06-08-2009-00.log
...
thousands of :
[1244325825] Warning: The check of service 'Partition /' on host
'server_xxx' looks like it was orphaned (results never came back).  I'm
scheduling an immediate check of the service...
and later, thousands of :
[1244355705] Warning: The check of service 'HTTP' on host 'server_xxx' could
not be performed due to a fork() error: 'Cannot allocate memory'.  The check
will be rescheduled.

If I do a strace on process 20578, it loops with :
nanosleep({0, 250000000}, NULL)         = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1892, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=1892, ...}) = 0

And a strace on process 20579 it loops with :
poll([{fd=5, events=POLLIN}], 1, 500)   = 0


A part of the config :
$ egrep 'status_update|reaper|orphan' /etc/nagios/nagios.cfg
status_update_interval=10
check_result_reaper_frequency=10
max_check_result_reaper_time=30
check_for_orphaned_services=1
check_for_orphaned_hosts=1


Thanks for any reply,

Best regards,
Samuel Bancal

-- 
Samuel Bancal - CH
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090608/4e187a87/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
OpenSolaris 2009.06 is a cutting edge operating system for enterprises 
looking to deploy the next generation of Solaris that includes the latest 
innovations from Sun and the OpenSource community. Download a copy and 
enjoy capabilities such as Networking, Storage and Virtualization. 
Go to: http://p.sf.net/sfu/opensolaris-get
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: re-execute an event handler just in case the service stays DOWN
Next message: Do services inherit downtime from their hosts?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list