Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

Andreas Ericsson ae at op5.se
Mon Nov 28 17:09:58 CET 2005


linux-system-technik at de.man-mn.com wrote:
> Hi everybody,
> 
> unfortunately nobody answered to Alex from viveconsulting.co.nz who had a
> problem with "Nagios spawning rogue ..." and mailed to nagios mailing list
> some months ago.


A link to the mail archives would be helpful.


> Right now, we have the same problemn very likely he
> described in a very detailed way. I tried also a lot of different things
> (from configuration changes to tuning issues) to find out the real problem
> and I guess the real bottleneck is the pipe used for communication between
> Nagios processes.


Most likely. It's the only real bottleneck in nagios today, so...


> But I found not many reports e.g. emails about this
> problem in the web and mail archives.
> 
> So why am I writing to list? Maybe someone can give me a hint, how to solve
> or workaround that problem? We have 677 services configured and use 350
> RRDs. Our Nagios CMS is a PIII 866 MHz with SCSI RAID 5. The system load is
> a little bit more than 1.00. As long as we stay below 1.00 no problem, but
> otherwise ... (Detailed problem description in Alexs' mail)
> 

CMS? Content Management System?
Anyways, 677 services shouldn't be a problem.


> This is just our start with Nagios. We want to configure thousands of
> services and more than 100 hundred hosts. We would also invest in faster
> hardware, dual CPU, 2GB memory and faster SCSI HDDs but is faster hardware
> an option?

It helps, but not very much I'm afraid. The bottleneck requires a kernel 
recompile to be solved on most systems, and that's a very bad thing to 
do just to fix this particular problem.

> Looking at this issue with the focus on implementation: If the
> pipe is the bottleneck it will stay a bottle neck on faster hardware too.
> But maybe faster hardware will allow us to configure 3000 services, what
> would be enough for the Nagios instance. And then, we deploy another Nagios
> instance ...
> 

This is definitely a solution. Otherwise you could keep your eyes open 
in the somewhat near future for a mail with

[PATCH] checks: Multiplex running checks.

in the topic. I'm working on it right now, but perhaps Ethan won't let 
it in for the 2.x branch since it's a fairly massive change.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click




More information about the Developers mailing list