Nagios 3.0 hanging (10/19 CVS)

Steffen Poulsen step at tdc.dk
Thu Oct 25 13:16:44 CEST 2007


Hi,

Just to add our 20 cent to this, this sounds very much like the problems we have been experiencing - latest thread is "3.0b5: External commands are not turned into passive checks after a while" from 15th of October.

As described in the mail we are also seing unregular memory usage, and if nagios runs long enough it will steal all available file descriptors, leaving us with:

root@<nagios server>:/usr/local/nagios/bin# /bin/echo "test"
bash: fork: Resource temporarily unavailable

- Until we can squeeze in a pkill nagios or similar.

We are running with embedded perl also. We have just compiled a new version without it and will try that one out (latest beta, 3.0b5 / not SVN).

Best regards,
Steffen Poulsen


> -----Oprindelig meddelelse-----
> Fra: nagios-devel-bounces at lists.sourceforge.net 
> [mailto:nagios-devel-bounces at lists.sourceforge.net] På vegne 
> af Andreas Ericsson
> Sendt: 22. oktober 2007 17:01
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Nagios 3.0 hanging (10/19 CVS)
> 
> Shad L. Lords wrote:
> > I've had a few instances where nagios will be running but 
> will fail to 
> > run checks or process anything.  I noticed it this morning 
> and did a 
> > quick strace of the process to see what it was trying to do (see 
> > below).  I hope this will be of use to someone.
> > 
> 
> It is indeed. Thanks a lot.
> 
> > open("/var/spool/nagios", 
> O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) 
> > = -1 EMFILE (Too many open files) 
> open("/var/log/nagios/nagios.log", 
> > O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE,
> > 0666) = -1 EMFILE (Too many open files)
> 
> 
> Here is the primary symptom of the problem, methinks. EMFILE 
> is a pretty unusual error. There's probably some (or a lot) 
> of codepaths in Nagios where the check result files aren't 
> closed properly, leading to all sorts of weird errors ...
> 
> > clone(child_stack=0, 
> > flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > child_tidptr=0xb7fe3708) = -1 ENOMEM (Cannot allocate memory)
> 
> ... and eventually it runs into the good ole ENOMEM. I'm 
> guessing this happens because the scheduling queue keeps 
> filling up more or less indefinitely, and the child processes 
> keep stacking up as well.
> 
> Personally, I think the only sane thing to do when you get 
> ENOMEM is, in the absence of garbage collectors to run, to 
> just die as gracefully as possible with a loud, loud error 
> message in the logs, and possibly leaving a core dump. 
> kill(0, SIGSEGV) can accomplish that last thing.
> 
> I won't have time to dig into this until tomorrow, but with 
> Ethan blazing through the codebase he'd probably have it 
> fixed before me anyway. :)
> 
> -- 
> Andreas Ericsson                   andreas.ericsson at op5.se
> OP5 AB                             www.op5.se
> Tel: +46 8-230225                  Fax: +46 8-230231
> 
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems?  Stop.
> Now Search log events and configuration files using AJAX and 
> a browser.
> Download your FREE copy of Splunk now >> 
> http://get.splunk.com/ _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
> 

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list