<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> </head> <body bgcolor="#ffffff" text="#000000"> I'm not sure adding more slaves will solve the problem on the master server.. From what I've been reading on these maillists, the problem tends to be the filling of the nagios.cmd pipe. Multiple processes try to write to it (NSCA children) and multiple processes try to read from it (Nagios children). From what I understand the NSCA daemons are all waiting to access the pipe, but can't, and then start to wait. NSCA creates another child child process, which starts to wait, etc etc. Within an hour the system started swapping because of the large amount of processes (I think I actually saw it going up to 6500!) and eventually it died. In my setup this problem started showing after I added 3 slave hosts... So I'm not sure if that will fix anything... My current (temporary) solution is a small script in the crontab, which restarts the NSCA daemon every 30 minutes or so. If it can't stop normally, it does a killall -9, and then forces another start. Seems to hold up so far, but the solution isn't pretty. <pre class="moz-signature" cols="72">Jeffrey Lensen System Administrator Hyves hyves page: <a class="moz-txt-link-freetext" href="http://skyler.hyves.nl">http://skyler.hyves.nl</a> mail/msn: <a class="moz-txt-link-abbreviated" href="mailto:jeffrey@hyves.nl">jeffrey@hyves.nl</a> </pre> Wheeler, JF (Jonathan) wrote: <blockquote cite="mid:F93ED76B6830FB4CB81262937940F726013B02F6@exchange11.fed.cclrc.ac.uk" type="cite"> <pre wrap="">-----Original Message----- From: nagios-users On Behalf Of Jeffrey Lensen Sent: 10 June 2007 08:28 </pre> <blockquote type="cite"> <pre wrap="">I recently extend our distributed Nagios setup of 1 master and 2 </pre> </blockquote> <pre wrap="">distributed slaves (in </pre> <blockquote type="cite"> <pre wrap="">which the master also had a lot of checks running), to 1 master and 5 </pre> </blockquote> <pre wrap="">distributed slaves </pre> <blockquote type="cite"> <pre wrap="">(in which the master does no checking at all, except for host checks). This setup had 556 hosts and roughly 7000 service checks. Ever since I </pre> </blockquote> <pre wrap="">modified this </pre> <blockquote type="cite"> <pre wrap="">setup, the Nagios master host has been giving me problems. The symptoms: - When starting both Nagios and NSCA, I see NSCA accepting checks in </pre> </blockquote> <pre wrap="">my logfiles, but none </pre> <blockquote type="cite"> <pre wrap="">get processed by Nagios. - After a few minutes NSCA processes are starting to build up, </pre> </blockquote> <pre wrap="">increasing with 5-10 </pre> <blockquote type="cite"> <pre wrap="">processes per second. In a few minutes it reaches a few thousand </pre> </blockquote> <pre wrap="">processes and the machine </pre> <blockquote type="cite"> <pre wrap="">starts hanging. - Sometimes the number of Nagios processes start increasing, instead </pre> </blockquote> <pre wrap="">of the NSCA </pre> <blockquote type="cite"> <pre wrap="">processes. Same result, the machine starts hanging. </pre> </blockquote> <pre wrap=""> I have seen similar problems, though in my case (1 master, 2 slaves, 824 hosts, 16000+ services) the queued NSCA processes are eventually flushed. However the Nagios master server also suffers from memory leaks; it eventually (after a period of 1 - 5 days) crashes with a kernel panic because there is no free memory or reaches a state where the kernel has killed all useful processes (e.g. nagios, nsca, sshd, ntpd, etc) in attempt to cure OOM (Out Of Memory) problems. Interestingly trying to strace the first daughter nsca process seems to bring everything into life and the queue of NSCA processes quickly flushes. I have tried running nagios using option -s to get configuration recommendations and nagiostats to get usage information on both master and slave servers, but they do not reveal anything useful. My current plan is to introduce 3 more slave servers as I have heard that this helps. Any comments would be helpful to me as well. Jonathan Wheeler e-Science Centre Rutherford Appleton Laboratory ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. <a class="moz-txt-link-freetext" href="http://sourceforge.net/powerbar/db2/">http://sourceforge.net/powerbar/db2/</a> _______________________________________________ Nagios-users mailing list <a class="moz-txt-link-abbreviated" href="mailto:Nagios-users@lists.sourceforge.net">Nagios-users@lists.sourceforge.net</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/nagios-users">https://lists.sourceforge.net/lists/listinfo/nagios-users</a> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. ::: Messages without supporting info will risk being sent to /dev/null </pre> </blockquote> </body> </html>