BUG/PATCH: Runaway processes under Linux (and others)

Andreas Ericsson ae at op5.se
Thu Apr 27 10:17:49 CEST 2006


bruce wrote:
> 
> This relates to a number of issues that people have seen with Nagios and 
> Nsca running under Linux, having many copies of these daemons running, 
> and eventually running out of memory, frequently crashing the machine.  
> This post attempts to summarise the problems for those searching the 
> archives.  If you have an OS/distribution/libraries that are susceptible 
> to this problem, here is a short summary:
> 
>         You're screwed.
> 
> The problem at heart is that Nagios, and Nsca, use function calls after 
> forking that are either susceptible to race conditions with other 
> children, have the possibility of blocking, or cancel pending alarm()s.
> 

To be noted also is that the problems with function calls after fork() 
only happens in threaded applications which can end up in deadlocks. The 
race is actually for mutexes and semaphores that are shared between the 
thread forking and other threads. Some thread-libraries jump through 
hoops to try to handle this, others do not. The standard doesn't require 
it but doesn't explicitly disallow or discourage it either.

> In practical terms, these two cases manifest themselves as a high number 
> of Nagios and/or Nsca processes, which are being created at a rate 
> slightly lower than the freqency of service checks being run/incoming 
> result submission.  Eventually, this will cause a crash, as very few 
> memory management schemes properly deal with the death-by-tiny-bites 
> situation.
> 

None, that I've seen.

> In the short term, the Nsca issue can be avoided by invoking 
> '/etc/init.d/nsca restart' from Cron every 5 minutes.  A dropped result 
> every 5 minutes is a comparitively small price to pay.

I think more than one result can be lost if more than one instance is 
spinning on the file-lock, but I'm not sure.

Anyways, this:


+ /* exit with a dirty feeling */
+ static void signal_exit( void ){
+ 	_exit(1);
+ 	}
+

is wrong. The prototype for signal handlers must be

	void signal_exit(int signum);

The static keyword is ofcourse optional and valid.

Otherwise it looks like a good patch.

> 
> On some systems, a rarer problem shows itself, making the solution to 
> the Nagios issue somewhat harder.  This problem is when a child process, 
> inheriting the parent's signal handlers, receives a signal (usually 
> SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's 
> lock/pid file.  Thus, one no longer knows which process is the 
> legitimate parent process.
> 

If nagios' grandchildren (the ones that popen() commands) receives 
SIGCHLD from anything but the check it's running something is very, very 
wrong with the system you're using. Are you perhaps using the old and 
deprecated NGPT-library?

> Tracking down this rare problem (which happens all too often to suit me) 
> led me to creating the attached Nagios patch, which turns off 
> daemon_mode right away after forking (so the lock file doesn't get 
> deleted if a stray signal comes in), resets the signal handlers a bit 
> earlier in the children (so the parent's signal handlers aren't 
> triggered) and reinstates the alarm before talking to the parent (rather 
> than no timeout).  Overall, I'd much rather missing test results (and 
> Nagios trying the service check again) than have my machines being 
> nibbled to death.
> 
> With these patches on, the rate of stray process creation has dropped, 
> but I am still seeing occasional orphaned processes around; ie, I've 
> fixed some of the symptons, but not the actual cause.  That will take 
> some more rewrites.
> 

Yup. The choice of a FIFO pipe for passing check-results back to the 
master process was unfortunately a bad one which is now irrevocable 
without major code-surgery.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list