bizarre Nagios 2.12 memory leak

Jeremy s6a9d6u9s at gmail.com
Thu Apr 15 17:24:20 CEST 2010


We have a large distributed setup running Nagios 2.12 with 20 distributed
servers sharing about 20000 checks against 2500 hosts. They are reporting
into multiple master Nagios servers using a modified OCP_daemon that handles
multiple master servers. Recently we nearly doubled our number of
distributed servers. Our number of checks had grown so we only were doing
about 20-30% per minute on some of our most busy distributed servers. Now we
are doing 90% per minute.

Ever since we increased the frequency of all the checks, our oldest Master
server has started crashing randomly every so often. Nothing else has
changed. Memory use goes through the roof until eventually there is 0 swap
left and the server finally crashes and has to be rebooted. If we restart
the Nagios service while the memory usage is going crazy, it drops back down
to normal for quite a while, but days later it will happen again. I started
restarting Nagios on that server once an hour but it hasn't helped. We tried
upgrading to 16 GB of RAM which has made this happen a bit less often, but
it continues to happen sometimes.

We are using NPCD to graph the performance data from all of our checks, but
all the graph .RRD files are on a dedicated partition, and the crashing
happens even when we disable graphing completely and disk I/O is near 0% on
both the system partitions and the graph partition.

So I was wondering how I could go about figuring out why Nagios is freaking
out on our older server (Dell PowerEdge 1950). Our other Master server (a
Dell PowerEdge R710) gets all the same checks reported to it, and handles it
just fine, but it using much newer Xeon CPUs, faster memory, etc. The old
crashing server handles things just fine for days at a time until it
randomly runs itself out of swap space and crashes.

I know I really should get around to upgrading to Nagios 3.x but no time for
that yet and it's going to be a pain to upgrade them all at once without
being blind for a little bit, so pretend Nagios 3.x isn't an option just
yet.

Thanks for any insight!
Jeremy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20100415/45e45270/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Download Intel® Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list