BUG/PATCH: Runaway processes under Linux (and others)

bruce nagios-devel at vicious.dropbear.id.au
Wed Apr 26 16:46:43 CEST 2006
Previous message: Nagios-devel digest, Vol 1 #1058 - 2 msgs
Next message: BUG/PATCH: Runaway processes under Linux (and others)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
This relates to a number of issues that people have seen with Nagios and 
Nsca running under Linux, having many copies of these daemons running, and 
eventually running out of memory, frequently crashing the machine.  This 
post attempts to summarise the problems for those searching the 
archives.  If you have an OS/distribution/libraries that are susceptible 
to this problem, here is a short summary:

 		You're screwed.

The problem at heart is that Nagios, and Nsca, use function calls after 
forking that are either susceptible to race conditions with other 
children, have the possibility of blocking, or cancel pending alarm()s.

Depending on your OS/distribution/libraries, usage of such functions 
within a fork()ed child may well mean that the alarm timeouts set simply 
do not arrive.  The child process will sit in an unknown state for a very 
long time.

In the case of Nagios, this has a high chance of occuring after it has 
fork()ed twice in base/checks.c->run_service_checks().  The main Nagios 
process does not know the PID of the grandchild, and has no checks in 
place to kill it after a timeout has elapsed.  Thus, if the (grand)child 
process just sits around, it will not cleaned up by Nagios.

In Nsca, there is no timeout set by default, and no reaping of child 
processes.  Thus, the child process can happily sit in an unknown state 
for as long as the parent daemon exists.  This happens more often when 
Nsca is running but Nagios is not, as the contention for the opening 
of the dump file, rather than the command pipe, more often results in 
blocking.

In practical terms, these two cases manifest themselves as a high number 
of Nagios and/or Nsca processes, which are being created at a rate 
slightly lower than the freqency of service checks being run/incoming 
result submission.  Eventually, this will cause a crash, as very few 
memory management schemes properly deal with the death-by-tiny-bites 
situation.

Since my normal solution of installing a, shall we say, more 
POSIX-compliant OS on the monitoring systems isn't valid in this 
particular Fedora-loving Linux camp, some other solutions need to be 
found.

In the short term, the Nsca issue can be avoided by invoking 
'/etc/init.d/nsca restart' from Cron every 5 minutes.  A dropped result 
every 5 minutes is a comparitively small price to pay.  The nsca patch 
attached sets up a timeout just after the fork for a new connection, which 
solves some of the issues.

On some systems, a rarer problem shows itself, making the solution to the 
Nagios issue somewhat harder.  This problem is when a child process, 
inheriting the parent's signal handlers, receives a signal (usually 
SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's 
lock/pid file.  Thus, one no longer knows which process is the legitimate 
parent process.

Tracking down this rare problem (which happens all too often to suit me) 
led me to creating the attached Nagios patch, which turns off daemon_mode 
right away after forking (so the lock file doesn't get deleted if a stray 
signal comes in), resets the signal handlers a bit earlier in the children 
(so the parent's signal handlers aren't triggered) and reinstates the 
alarm before talking to the parent (rather than no timeout).  Overall, I'd 
much rather missing test results (and Nagios trying the service check 
again) than have my machines being nibbled to death.

With these patches on, the rate of stray process creation has dropped, but 
I am still seeing occasional orphaned processes around; ie, I've fixed 
some of the symptons, but not the actual cause.  That will take some more 
rewrites.

--==--
Bruce.
-------------- next part --------------
*** src/nsca.c	2006/04/26 12:56:18
--- src/nsca.c	2006/04/26 13:00:50
***************
*** 254,259 ****
--- 254,264 ----
  	exit(return_code);
          }
  
+ /* exit with a dirty feeling */
+ static void signal_exit( void ){
+ 	_exit(1);
+ 	}
+ 
  
  
  /* read in the configuration file */
***************
*** 750,755 ****
--- 755,764 ----
                          return;
                          }
  		else{
+ 			/* Set up a timeout for our doom */
+ 			signal(SIGALRM,signal_exit);
+ 			alarm( 120 );
+ 
                          /* child does not need to listen for connections */
                          close(sock);
                          }
-------------- next part --------------
*** base/checks.c	2006/04/26 12:47:04
--- base/checks.c	2006/04/26 13:40:15
***************
*** 68,73 ****
--- 68,75 ----
  extern int      check_service_freshness;
  extern int      check_host_freshness;
  
+ extern int      daemon_mode;
+ 
  extern time_t   program_start;
  
  extern timed_event       *event_list_low;
***************
*** 378,383 ****
--- 380,392 ----
  	/* if we are in the child process... */
  	else if(pid==0){
  
+ 		/* Turn off daemon_mode right away so the lock file is not
+ 		 * deleted. */
+ 		daemon_mode=FALSE;
+ 
+                 /* reset signal handling */
+                 reset_sighandler();
+ 
  		/* set environment variables */
  		set_all_macro_environment_vars(TRUE);
  
***************
*** 448,454 ****
  #endif
  
  				/* reset the alarm */
! 				alarm(0);
  
  				/* get the check finish time */
  				gettimeofday(&end_time,NULL);
--- 457,463 ----
  #endif
  
  				/* reset the alarm */
! 				alarm(service_check_timeout);
  
  				/* get the check finish time */
  				gettimeofday(&end_time,NULL);
***************
*** 497,503 ****
  			pclose_result=pclose(fp);
  
  			/* reset the alarm */
! 			alarm(0);
  
  			/* get the check finish time */
  			gettimeofday(&end_time,NULL);
--- 506,512 ----
  			pclose_result=pclose(fp);
  
  			/* reset the alarm */
! 			alarm(service_check_timeout);
  
  			/* get the check finish time */
  			gettimeofday(&end_time,NULL);
*** base/utils.c	2006/04/26 12:48:26
--- base/utils.c	2006/04/26 13:16:31
***************
*** 2721,2726 ****
--- 2721,2732 ----
  	/* execute the command in the child process */
          if (pid==0){
  
+ 		/* Turn off daemon_mode right away */
+ 		daemon_mode=FALSE;
+ 
+ 		/* reset signal handling */
+ 		reset_sighandler();
+ 
  		/* become process group leader */
  		setpgid(0,0);
  
***************
*** 2732,2740 ****
  		free_memory();
  #endif
  
- 		/* reset signal handling */
- 		reset_sighandler();
- 
  		/* close pipe for reading */
  		close(fd[0]);
  
--- 2738,2743 ----
***************
*** 2788,2796 ****
  			/* close pipe for writing */
  			close(fd[1]);
  
- 			/* reset the alarm */
- 			alarm(0);
- 
  			_exit(status);
  		        }
  
--- 2791,2796 ----
***************
*** 2842,2850 ****
  		/* close pipe for writing */
  		close(fd[1]);
  
- 		/* reset the alarm */
- 		alarm(0);
- 		
  		/* clear environment variables */
  		set_all_macro_environment_vars(FALSE);
  
--- 2842,2847 ----
Previous message: Nagios-devel digest, Vol 1 #1058 - 2 msgs
Next message: BUG/PATCH: Runaway processes under Linux (and others)
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list