FW: Problems with distributed setup, master overload?

Jeffrey Lensen jeffrey at hyves.nl
Wed Jun 13 15:39:52 CEST 2007


I'm not sure adding more slaves will solve the problem on the master
server.. From what I've been reading on these maillists, the problem
tends to be the filling of the nagios.cmd pipe.
Multiple processes try to write to it (NSCA children) and multiple
processes try to read from it (Nagios children). From what I understand
the NSCA daemons are all waiting to access the pipe, but can't, and then
start to wait. NSCA creates another child child process, which starts to
wait, etc etc. Within an hour the system started swapping because of the
large amount of processes (I think I actually saw it going up to 6500!)
and eventually it died.
In my setup this problem started showing after I added 3 slave hosts...
So I'm not sure if that will fix anything...

My current (temporary) solution is a small script in the crontab, which
restarts the NSCA daemon every 30 minutes or so. If it can't stop
normally, it does a killall -9, and then forces another start. Seems to
hold up so far, but the solution isn't pretty.

Jeffrey Lensen
System Administrator Hyves
hyves page: http://skyler.hyves.nl
mail/msn:   jeffrey at hyves.nl



Wheeler, JF (Jonathan) wrote:
> -----Original Message-----
> From: nagios-users On Behalf Of Jeffrey Lensen
> Sent: 10 June 2007 08:28
>
>   
>> I recently extend our distributed Nagios setup of 1 master and 2
>>     
> distributed slaves (in 
>   
>> which the master also had a lot of checks running), to 1 master and 5
>>     
> distributed slaves
>   
>> (in which the master does no checking at all, except for host checks).
>>
>> This setup had 556 hosts and roughly 7000 service checks. Ever since I
>>     
> modified this
>   
>> setup, the Nagios master host has been giving me problems. 
>>
>> The symptoms:
>> - When starting both Nagios and NSCA, I see NSCA accepting checks in
>>     
> my logfiles, but none
>   
>> get processed by Nagios.
>> - After a few minutes NSCA processes are starting to build up,
>>     
> increasing with 5-10
>   
>> processes per second. In a few minutes it reaches a few thousand
>>     
> processes and the machine
>   
>> starts hanging.
>> - Sometimes the number of Nagios processes start increasing, instead
>>     
> of the NSCA
>   
>> processes. Same result, the machine starts hanging.
>>     
>
> I have seen similar problems, though in my case (1 master, 2 slaves, 824
> hosts, 16000+ services) the queued NSCA processes are eventually
> flushed.  However the Nagios master server also suffers from memory
> leaks; it eventually (after a period of 1 - 5 days) crashes with a
> kernel panic because there is no free memory or reaches a state where
> the kernel has killed all useful processes (e.g. nagios, nsca, sshd,
> ntpd, etc) in attempt to cure OOM (Out Of Memory) problems.
> Interestingly trying to strace the first daughter nsca process seems to
> bring everything into life and the queue of NSCA processes quickly
> flushes.
>
> I have tried running nagios using option -s to get configuration
> recommendations and nagiostats to get usage information on both master
> and slave servers, but they do not reveal anything useful.  My current
> plan is to introduce 3 more slave servers as I have heard that this
> helps.
>
> Any comments would be helpful to me as well.
>
> Jonathan Wheeler
> e-Science Centre
> Rutherford Appleton Laboratory
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20070613/a36b42df/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list