bizarre Nagios 2.12 memory leak

Giorgio Zarrelli zarrelli at linux.it
Thu Apr 15 17:43:22 CEST 2010


Did you check zombie procs growth an iowait om cpu?

Ciao,

Giorgio

Il giorno 15/apr/2010, alle ore 17.24, Jeremy <s6a9d6u9s at gmail.com> ha  
scritto:

> We have a large distributed setup running Nagios 2.12 with 20  
> distributed servers sharing about 20000 checks against 2500 hosts.  
> They are reporting into multiple master Nagios servers using a  
> modified OCP_daemon that handles multiple master servers. Recently  
> we nearly doubled our number of distributed servers. Our number of  
> checks had grown so we only were doing about 20-30% per minute on  
> some of our most busy distributed servers. Now we are doing 90% per  
> minute.
>
> Ever since we increased the frequency of all the checks, our oldest  
> Master server has started crashing randomly every so often. Nothing  
> else has changed. Memory use goes through the roof until eventually  
> there is 0 swap left and the server finally crashes and has to be  
> rebooted. If we restart the Nagios service while the memory usage is  
> going crazy, it drops back down to normal for quite a while, but  
> days later it will happen again. I started restarting Nagios on that  
> server once an hour but it hasn't helped. We tried upgrading to 16  
> GB of RAM which has made this happen a bit less often, but it  
> continues to happen sometimes.
>
> We are using NPCD to graph the performance data from all of our  
> checks, but all the graph .RRD files are on a dedicated partition,  
> and the crashing happens even when we disable graphing completely  
> and disk I/O is near 0% on both the system partitions and the graph  
> partition.
>
> So I was wondering how I could go about figuring out why Nagios is  
> freaking out on our older server (Dell PowerEdge 1950). Our other  
> Master server (a Dell PowerEdge R710) gets all the same checks  
> reported to it, and handles it just fine, but it using much newer  
> Xeon CPUs, faster memory, etc. The old crashing server handles  
> things just fine for days at a time until it randomly runs itself  
> out of swap space and crashes.
>
> I know I really should get around to upgrading to Nagios 3.x but no  
> time for that yet and it's going to be a pain to upgrade them all at  
> once without being blind for a little bit, so pretend Nagios 3.x  
> isn't an option just yet.
>
> Thanks for any insight!
> Jeremy
> --- 
> --- 
> --- 
> ---------------------------------------------------------------------
> Download Intel® Parallel Studio Eval
> Try the new software tools for yourself. Speed compiling, find bugs
> proactively, and fine-tune applications for parallel performance.
> See why Intel Parallel Studio got high marks during beta.
> http://p.sf.net/sfu/intel-sw-dev
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when  
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null

------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list