Number of Nagios Processes Distributed Monitoring

Mooney, Ryan ryan.mooney at pnl.gov
Fri Jul 25 18:56:24 CEST 2003


I had a simular problem when doing lots of external checks.  The sub process that
gets forked to read the results from the .cmd pipe and then write them to the shared
fd to the master process would block (forever) on the write call.  I never did figure 
out why, since the code appeared to be correct.  I ended up putting an alarm around
the write call and timing it out if it hung to long.  I figured that loosing a few 
passive checks was worth not having memory fill up & having the machine die.  Based on
the behavior I saw, I'm not really convinced that the problem is 100% limited to the
passive checks though, as a very simular set of routines is used by the active checks
code.

If you compile nagios with debugging (export "CFLAGS=-g"; ./configure --whatever-options-you-use; make; make install) and then watch the "ps aux" output you'll notice
that there is one really long running process that takes a fair bit of CPU (which is 
the good master) and then over time you'll start seeing some other processes that have
a start time a fair bit in the past that never die.  If you attach to one of these with
a debugger (say "cd /wherever/you/compiled/nagios/; gdb base/nagios [pid]" where [pid] 
is the process ID of one of the processes with a start time > 1hr ago that is not the 
master process) and do a "bt" to get a call trace out of it that would likely help 
determine where the processes are getting stuck.

If you are having the same problem I was  you will likely see "process_passive_service_checks" and/or "check_for_external_commands" in the call trace 
(sometimes the stack looks munged so the call stack may not be 100% accurate, leading me 
to believe that some corruption is whats causing the write to hang, but I wasn't able to 
figure out what was causing the corruption easily and had to "get things working").

I'd be curious to see if its the same problem.

> >Jasmine 
> I am pretty sure, not nagios itself, but memory ran out and the server
> stood. 
> At the moment I have a nagios uptime of : 
> 
> Total Running Time: 0d 6h 6m 15s 
> And this... 
> Check Command Output:  Nagios ok: located 1677 processes, status log
> updated 170 seconds ago   
> 
> I am pretty sure this is mot ok,
> 
> Any Ideas ? 
> 
> I will let the server run over the weekend, when it crashes again, I
> give detailed information to the list. 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email sponsored by: Free pre-built ASP.NET sites including
> Data Reports, E-commerce, Portals, and Forums are available now.
> Download today and enter to win an XBOX or Visual Studio .NET.
> http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet
> _072303_01/01
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS 
> when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 


-------------------------------------------------------
This SF.Net email sponsored by: Free pre-built ASP.NET sites including
Data Reports, E-commerce, Portals, and Forums are available now.
Download today and enter to win an XBOX or Visual Studio .NET.
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list