Number of Nagios Processes Distributed Monitoring

Mooney, Ryan ryan.mooney at pnl.gov
Mon Aug 25 18:05:55 CEST 2003


No official fix that I've seen, although I haven't been tracking CVS.

At least on other solution was proposed, this may be a "better" solution, I don't
know (both are a bit ugly IMHO, but hey can't argue to hard w/ success).

Jay 'Whip' Grizzard [elfchief at lupine.org] explained the problem thusly:

	After much investigation, the best conclusion I've been able to draw is that
	the process scheduling on our system (RedHat 8.0) is behaving in such a way
	that, after the parent process is able to clear out its pipe a bit (thus
	freeing some buffer), some processes are always scheduled -last-, giving 
	other (newer) processes a chance to fill up the pipe's buffer again before 
	the  'hung' processes get a chance to run again.

	Since I didn't feel like rewriting the linux process scheduler, I instead 
	opted to increase the size of the pipe buffer in the kernel (there's a 
	define for PIPE_SIZE that's normally set to PAGE_SIZE -- 4k on x86). I 
	increased it to (8 * PAGE_SIZE) and rebuilt my kernel, under the theory
	that a larger buffer would give processes a much larger chance of being
	able to get some data into the buffer before it filled ... and, indeed, 
	the 'hung' processes seem to have gone away -- After 24 hours, the oldest
	nagios subprocesses on the box are, at worst, one minute old.

> -----Original Message-----
> From: Mike Benoit [mailto:mikeb at netnation.com]
> Sent: Monday, August 25, 2003 8:59 AM
> To: Mooney, Ryan
> Cc: nagios-users; nagios-users at lists.sourceforge.net
> Subject: RE: [Nagios-users] Number of Nagios Processes Distributed
> Monitoring
> 
> 
> I'm having the exact same problem with Nagios 1.1. There 
> hasn't been any
> official fix for this released yet correct? It sure makes 
> using passive
> checks difficult. :(
> 
> On Fri, 2003-07-25 at 09:56, Mooney, Ryan wrote:
> > I had a simular problem when doing lots of external checks. 
>  The sub process that
> > gets forked to read the results from the .cmd pipe and then 
> write them to the shared
> > fd to the master process would block (forever) on the write 
> call.  I never did figure 
> > out why, since the code appeared to be correct.  I ended up 
> putting an alarm around
> > the write call and timing it out if it hung to long.  I 
> figured that loosing a few 
> > passive checks was worth not having memory fill up & having 
> the machine die.  Based on
> > the behavior I saw, I'm not really convinced that the 
> problem is 100% limited to the
> > passive checks though, as a very simular set of routines is 
> used by the active checks
> > code.
> > 
> > If you compile nagios with debugging (export "CFLAGS=-g"; 
> ./configure --whatever-options-you-use; make; make install) 
> and then watch the "ps aux" output you'll notice
> > that there is one really long running process that takes a 
> fair bit of CPU (which is 
> > the good master) and then over time you'll start seeing 
> some other processes that have
> > a start time a fair bit in the past that never die.  If you 
> attach to one of these with
> > a debugger (say "cd /wherever/you/compiled/nagios/; gdb 
> base/nagios [pid]" where [pid] 
> > is the process ID of one of the processes with a start time 
> > 1hr ago that is not the 
> > master process) and do a "bt" to get a call trace out of it 
> that would likely help 
> > determine where the processes are getting stuck.
> > 
> > If you are having the same problem I was  you will likely 
> see "process_passive_service_checks" and/or 
> "check_for_external_commands" in the call trace 
> > (sometimes the stack looks munged so the call stack may not 
> be 100% accurate, leading me 
> > to believe that some corruption is whats causing the write 
> to hang, but I wasn't able to 
> > figure out what was causing the corruption easily and had 
> to "get things working").
> > 
> > I'd be curious to see if its the same problem.
> > 
> > > >Jasmine 
> > > I am pretty sure, not nagios itself, but memory ran out 
> and the server
> > > stood. 
> > > At the moment I have a nagios uptime of : 
> > > 
> > > Total Running Time: 0d 6h 6m 15s 
> > > And this... 
> > > Check Command Output:  Nagios ok: located 1677 processes, 
> status log
> > > updated 170 seconds ago   
> > > 
> > > I am pretty sure this is mot ok,
> > > 
> > > Any Ideas ? 
> > > 
> > > I will let the server run over the weekend, when it 
> crashes again, I
> > > give detailed information to the list. 
> > > 
> > > 
> > > 
> > > -------------------------------------------------------
> > > This SF.Net email sponsored by: Free pre-built ASP.NET 
> sites including
> > > Data Reports, E-commerce, Portals, and Forums are available now.
> > > Download today and enter to win an XBOX or Visual Studio .NET.
> > > http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet
> > > _072303_01/01
> > > _______________________________________________
> > > Nagios-users mailing list
> > > Nagios-users at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > > ::: Please include Nagios version, plugin version (-v) and OS 
> > > when reporting any issue. 
> > > ::: Messages without supporting info will risk being sent 
> to /dev/null
> > > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.Net email sponsored by: Free pre-built ASP.NET 
> sites including
> > Data Reports, E-commerce, Portals, and Forums are available now.
> > Download today and enter to win an XBOX or Visual Studio .NET.
> > 
http://aspnet.click-url.com/go/psa00100003ave/direct;at.aspnet_072303_01/01
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null




-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines
at the same time. Free trial click here:http://www.vmware.com/wl/offer/358/0
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list