fork errors

Terry td3201 at gmail.com
Tue Sep 6 16:46:38 CEST 2005


I haven't tried the fork script but I did try starting nagios under strace 
as described but the main process which strace is tracing appears to exit 
after spawning children. Here is the last snippet so you know what I mean:

[pid 29599] clone(Process 29644 attached
<unfinished ...>
[pid 29643] <... clone resumed> child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0xb75e50c8) = 29644
[tcb table full]
[pid 29599] <... clone resumed> child_stack=0, 
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, 
child_tidptr=0xb75e50c8) = 29645
[tcb table full]
Process 29641 detached
Process 29599 detached
Process 29632 detached
Process 29640 detached
Process 29635 detached
Process 29644 detached
Process 29643 detached


Any other ideas?

On 9/2/05, Fred <f1216 at yahoo.com> wrote:
> 
> Just for fun, you might try creating the problem and see how many forks
> you *can* get, for example:
>  #!/usr/bin/perl
> 
> my $c=0;
> do {
> my $pid = fork();
> if ($pid)
> {
> $c++;
> print "\rchildcount $c ";
> }
> else
> {
> sleep(1);
> exit(0);
> }
> } while 1;
> to create as many procs as you can and test your limit. You would
> want to do this under the same environment as the nagios process
> runs.
>  They will all be kept defunct until the process exits (when you
> hit the max processes you can create)
>  The other thing you might try is to start nagios under
> strace -f and output the data to a log. You can specify
> just forks for strace, i.e., strace -f -e trace=process >/tmp/,log 2>&1 
> nagios ....
>  That would give you a good handle on what is going on when the failure
> occurs. Might slow nagios down a bit, but probably nothing significant.
>  -FredC
>    
> 
> *Terry <td3201 at gmail.com>* wrote:
> 
> I have a program that checks the logs by the minute and pages when the
> fork errors occur, so we are responding within minutes. I have looked
> at the resources every time it happens and we have plenty of
> resources. Is there a single plugin I can put into debugging mode so
> that when this happens I get more information as to why it is giving
> these errors? Here are a few facts:
> - the system is fine with memory all the time, never runs out 
> (resident/paging)
> - there are not an unusual amount of processes running, maybe around
> 200 at a time, but no where near the ulimit setting
> - ulimit for the 'nagios' user matches that of root (unlimited). here
> is the ulimit:
> core file size (blocks, -c) 0
> data seg size (kbytes, -d) unlimited
> file size (blocks, -f) unlimited
> max locked memory (kbytes, -l) 4
> max memory size (kbytes, -m) unlimited
> open files (-n) 1024
> pipe size (512 bytes, -p) 8
> stack size (kbytes, -s) 10240
> cpu time (seconds, -t) unlimited
> max user processes (-u) 7168
> virtual memory (kbytes, -v) unlimited
> 
> Thanks,
> Terry
> 
> 
> 
> On 9/1/05, Fred wrote:
> > My guess would be to look at your resource utilization on your system,
> > most likely causes for fork() to fail are no more process slots, out of
> > memory, or past some kind of per-user (non-root) limit. When this
> > occurs look at your system logs, ps output and see if you have *lots*
> > of processes hanging around. It could be that nagios has stopped reaping
> > its children (or another unrelated process has sucked up the resources)
> > and you have simply pushed your system to the edge. It might be that you
> > get to that situation and it backs off before you even notice it and you
> > are left with nagios having problems dealing with the aftermath.
> > 
> > -FredC
> > 
> > --- Terry wrote:
> > 
> > > Hello,
> > >
> > > I have been having this issue for quite some time. For some unknown
> > > reason, nagios stops performing checks with these errors:
> > >
> > > [1125536952] Warning: The check of service 'PING' on host 'hostname'
> > > could not be performed due to a fork() error. The check will be
> > > rescheduled.
> > >
> > > All checks fail like this until nagios is restarted. When this
> > > problem is occuring I can run the service checks manually both as the
> > > nagios user and as the root user. There are no resource problems that
> > > I can see at the time. We do not appear to be hitting a limit with
> > > open files or anything like that either. The nagios mirrors the root
> > > user in that area.
> > >
> > > What could be wrong?
> > >
> > > Thanks!
> > >
> > >
> > > -------------------------------------------------------
> > > SF.Net email is Sponsored by the Better Software Conference & EXPO
> > > September 19-22, 2005 * San Francisco, CA * Development Lifecycle 
> Practices
> > > Agile & Plan-Driven Development * Managing Projects & Teams * Testing 
> & QA
> > > Security * Process Improvement & Measurement * 
> http://www.sqe.com/bsce5sf
> > > _______________________________________________
> > > Nagios-users mailing list
> > > Nagios-users at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > > ::: Please include Nagios version, plugin version (-v) and OS when 
> reporting
> > > any issue.
> > > ::: Messages without supporting info will risk being sent to /dev/null
> > >
> > 
> > 
> > 
> > 
> > 
> >
> 
> 
> 
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050906/44561a09/attachment.html>


More information about the Users mailing list