Nagios zombies take down system.

Chris Gill cgill at newworldapps.com
Thu Jan 15 16:50:40 CET 2004

Previous message: strange cgi/hostextinfo behavior
Next message: i18n
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello all,
	I've got a problem I've been wrestling with for a couple weeks now. A new monitoring server we're deploying winds up freezing overnight. This seems to be the result of check modules going zombie and hanging around. The machine becomes exceedingly slow and effectively unresponsive. The thing that really gets me is that this is a slow process. The machine usually runs, nice and snappy, for 12-18 hours. I've tried tuning the service_reaper setting and set it down to 2, which seems to help some but not enough to make the system stable. In addition to the zombied checks, I've also got a number of nagios processes left lying about, even after doing a nagios stop and killing all the bad checks. A sample from ps -xa

 1770 tty1     S      0:00 /sbin/mingetty tty1
 1771 tty2     S      0:00 /sbin/mingetty tty2
 1772 tty3     S      0:00 /sbin/mingetty tty3
 1773 tty4     S      0:00 /sbin/mingetty tty4
 1774 tty5     S      0:00 /sbin/mingetty tty5
 1781 tty6     S      0:00 /sbin/mingetty tty6
 3635 ?        SW     0:02 [nagios]
17743 ?        S      0:01 /usr/bin/python /usr/local/sbin/nagios-statd -a 127.0
32158 ?        SW     0:00 [nagios]
32547 ?        SW     0:00 [nagios]
32588 ?        SW     0:00 [nagios]
  521 ?        SW     0:00 [nagios]
  524 ?        SW     0:00 [nagios]
  568 ?        SW     0:00 [nagios]
  571 ?        Z      0:00 [nagios <defunct>]
  621 ?        SW     0:00 [nagios]
10854 ?        S      0:00 [httpd]
11007 ?        S      0:00 [httpd]
21525 ?        SW     0:00 [httpd]
 2524 ?        SW     0:00 [httpd]
 2859 ?        SW     0:00 [httpd]
 5372 ?        R      0:00 /usr/sbin/sshd
 5728 pts/0    S      0:00 -bash
 8841 ?        SW     0:00 [httpd]
 9086 ?        SW     0:00 [httpd]
 9900 pts/0    R      0:00 ps -xa

Here's what uptime had to say when I finally got into the box this morning:
 09:50:52  up 23:59,  1 user,  load average: 24.28, 37.35, 66.70

Getting all the bad nagios stuff killed off does get the system righted. It's running 150 hosts with 219 checks, so this seems like a small install, compared to what I've read about some people doing. The system's a dual P3 800 running Redhat 9. I've used RH9 before on other monitoring systems without any problems, although without quite so many checks, and on less powerful machines. I can provide clips from my nagios.conf if that'd help, but I don't want to load up this e-mail too much.

Any ideas would be appreciated, as they would help me from ripping all my hair out. Thanks.



-----------------------------------------------------------------
Christopher P. Gill, Systems Engineer, New World Apps
cgill at newworldapps.com
703-856-7268 (Cell/Business)



-------------------------------------------------------
The SF.Net email is sponsored by EclipseCon 2004
Premiere Conference on Open Tools Development and Integration
See the breadth of Eclipse activity. February 3-5 in Anaheim, CA.
http://www.eclipsecon.org/osdn
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

Previous message: strange cgi/hostextinfo behavior
Next message: i18n
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list