BUG/PATCH: Runaway processes under Linux (and others)

bruce nagios-devel at vicious.dropbear.id.au
Thu Apr 27 11:49:10 CEST 2006


On Thu, 27 Apr 2006, Andreas Ericsson wrote:

> bruce wrote:

> Anyways, this:
>
>
> + /* exit with a dirty feeling */
> + static void signal_exit( void ){
> + 	_exit(1);
> + 	}
> +
>
> is wrong. The prototype for signal handlers must be
>
> 	void signal_exit(int signum);
>
> The static keyword is ofcourse optional and valid.
>
> Otherwise it looks like a good patch.

Ah.  I bow to your greater C-Fu ;).  Duly edited and applied on my working 
copy.

>> On some systems, a rarer problem shows itself, making the solution to the 
>> Nagios issue somewhat harder.  This problem is when a child process, 
>> inheriting the parent's signal handlers, receives a signal (usually 
>> SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's 
>> lock/pid file.  Thus, one no longer knows which process is the legitimate 
>> parent process.
>
> If nagios' grandchildren (the ones that popen() commands) receives SIGCHLD 
> from anything but the check it's running something is very, very wrong with 
> the system you're using. Are you perhaps using the old and deprecated 
> NGPT-library?

The grandchild occurs in run_system_checks(), and I haven't caught child 
processes created from that segment of code removing the lock file, 
although this may be unwillingness on my part to fully match up the debug 
output ;).  ( For the record, the thread library used according to 
'getconf GNU_LIBPTHREAD_VERSION', is 'NPTL 2.3.6' ).

The lock removal instead seems to be occuring with the child process 
created in my_system(), which sometimes stalls at a point before the 
signal handlers get reset (or they don't get reset, my debugging 
statements weren't fine-grained enough).  When the parent sends a TERM 
signal to the child when it is in this state (due to timeout), the child 
runs the signal handlers inherited from the parent, removing the lock 
file.

>> With these patches on, the rate of stray process creation has dropped, but 
>> I am still seeing occasional orphaned processes around;

Overnight, I had one machine fail due to the death-by-nibbles problem, 
which due to its location and sudden lack of boot sector, will be a 
two-banana fix.  As an interim fix, the remaining machines are now 
restarting Nagios every two hours from cron, although this smacks of 
inelegance.

>> ie, I've fixed some 
>> of the symptons, but not the actual cause.  That will take some more 
>> rewrites.
>
> Yup. The choice of a FIFO pipe for passing check-results back to the master 
> process was unfortunately a bad one which is now irrevocable without major 
> code-surgery.

Yes.  It has scaling issues which do not show themselves in small 
installations (say, under 100 service checks).

-- 
   Bruce Campbell


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the Developers mailing list