fork errors

Fred f1216 at yahoo.com
Fri Sep 2 22:16:29 CEST 2005


Just for fun, you might try creating the problem and see how many forks
you *can* get, for example:
 
#!/usr/bin/perl

my $c=0;
do {
  my $pid = fork();
  if ($pid)
      {
      $c++;
      print "\rchildcount $c         ";
      }
  else
      {
      sleep(1);
      exit(0);
      }
  } while 1;

to create as many procs as you can and test your limit.  You would
want to do this under the same environment as the nagios process
runs.
 
They will all be kept defunct until the process exits (when you
hit the max processes you can create)
 
The other thing you might try is to start nagios under
strace -f and output the data to a log.   You can specify
just forks for strace, i.e., strace -f -e trace=process >/tmp/,log 2>&1 nagios ....
 
That would give you a good handle on what is going on when the failure
occurs.  Might slow nagios down a bit, but probably nothing significant.
 
-FredC
 
 
 


Terry <td3201 at gmail.com> wrote:
I have a program that checks the logs by the minute and pages when the
fork errors occur, so we are responding within minutes. I have looked
at the resources every time it happens and we have plenty of
resources. Is there a single plugin I can put into debugging mode so
that when this happens I get more information as to why it is giving
these errors? Here are a few facts:
- the system is fine with memory all the time, never runs out (resident/paging)
- there are not an unusual amount of processes running, maybe around
200 at a time, but no where near the ulimit setting
- ulimit for the 'nagios' user matches that of root (unlimited). here
is the ulimit:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) 4
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited

Thanks,
Terry



On 9/1/05, Fred wrote:
> My guess would be to look at your resource utilization on your system,
> most likely causes for fork() to fail are no more process slots, out of
> memory, or past some kind of per-user (non-root) limit. When this
> occurs look at your system logs, ps output and see if you have *lots*
> of processes hanging around. It could be that nagios has stopped reaping
> its children (or another unrelated process has sucked up the resources)
> and you have simply pushed your system to the edge. It might be that you
> get to that situation and it backs off before you even notice it and you
> are left with nagios having problems dealing with the aftermath.
> 
> -FredC
> 
> --- Terry wrote:
> 
> > Hello,
> >
> > I have been having this issue for quite some time. For some unknown
> > reason, nagios stops performing checks with these errors:
> >
> > [1125536952] Warning: The check of service 'PING' on host 'hostname'
> > could not be performed due to a fork() error. The check will be
> > rescheduled.
> >
> > All checks fail like this until nagios is restarted. When this
> > problem is occuring I can run the service checks manually both as the
> > nagios user and as the root user. There are no resource problems that
> > I can see at the time. We do not appear to be hitting a limit with
> > open files or anything like that either. The nagios mirrors the root
> > user in that area.
> >
> > What could be wrong?
> >
> > Thanks!
> >
> >
> > -------------------------------------------------------
> > SF.Net email is Sponsored by the Better Software Conference & EXPO
> > September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> > _______________________________________________
> > Nagios-users mailing list
> > Nagios-users at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > ::: Please include Nagios version, plugin version (-v) and OS when reporting
> > any issue.
> > ::: Messages without supporting info will risk being sent to /dev/null
> >
> 
> 
> 
> 
> 
>





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050902/c8fe0d8c/attachment.html>


More information about the Users mailing list