nagios and apan cause server to crash... [FIX]

jeff vier jeff.vier at tradingtechnologies.com
Tue Nov 18 16:11:17 CET 2003


Just wanted to post this to the lists as well...

On Mon, 2003-11-17 at 20:44, Evan Weston wrote:
> Thanks heaps for this, as suggested I have upgraded the system to Fedora-core with the latest glibc off up2date today and all seems to be running fine so far. I guess time will tell if its fixed permanently or not :)
> 
> Evan
> 
> -----Original Message-----
> From: jeff vier [mailto:jeff.vier at tradingtechnologies.com]
> Sent: Tuesday, 18 November 2003 2:15 AM
> To: Evan Weston
> Subject: RE: [Apan-users] Re: [Nagios-users] nagios and apan cause
> servertocrash...
> 
> 
> I should follow up to the list about this, but I determined it wasn't
> apan's "fault" per se, rather it was a kernel problem being aggravated
> by apan.  So, while my watcher did "work" (in that it did what I wanted
> - killed apan.sh PIDs) it would only do so for a few seconds until the
> box crash overwhelmed it anyway.
> 
> My recommendation is to upgrade to the Fedora kernel
> (kernel-smp-2.4.22-1.2088.nptl) and the latest glibc (glibc-2.3.2-27.9)
> (and requisite dependencies, of course).  Be careful with the glibc
> update - use the up2date utility or you could miss a dependency that
> will cause major system problems (don't depend on the raw rpm
> utility...this is a known problem when updating from base RH9!)
> 
> After updating (and rebooting with) the new kernel, that machine has
> been up and happy for 2 weeks and running strong (whereas before we'd be
> lucky to get 4 days!)
> 
> Hope this helps - if you have any questions, please let me know.
> 
> On Sun, 2003-11-16 at 17:15, Evan Weston wrote:
> > Hi Jeff,
> > 
> > Could I please get a copy of the watcher daemon you wrote to keep apan under control?
> > 
> > Thanks
> > 
> > Evan
> > 
> > -----Original Message-----
> > From: jeff vier [mailto:jeff.vier at tradingtechnologies.com]
> > Sent: Wednesday, 15 October 2003 1:46 AM
> > To: Fredrik Wänglund
> > Cc: nagios-users; Apan-users List
> > Subject: Re: [Apan-users] Re: [Nagios-users] nagios and apan cause
> > serverto crash...
> > 
> > 
> > On Tue, 2003-10-14 at 01:21, Fredrik Wänglund wrote:
> > > What platform/version are you running on?
> > 
> > RH9, dual 1.4GHz, 1G RAM
> > 139 Hosts, 838 services, about 300 of which are apan-based
> > 
> > > I'm running without any problem under RedHat 8.0 on a PIII 1400MHz with 
> > > 170 hosts, 200 apan-services and 300 'normal' services.
> > > My system-load stays between 1 and 2, CPU is mainly >80% idle
> > 
> > until the "apan problem", load hangs out at around .3 to .9 (depending
> > on what it's doing - it's only blipped over 1.0 twice in 24 hours) with
> > an idle solidly at 86% (sar shows a min of 85.23% idle and a max of
> > 86.96% in the last 24 hours).  So the box is quite clean and happy.
> > 
> > Like I said before, when the apan freak-out comes around, though, it
> > shoots WAY up.
> > 
> > Notably, I wrote a little watcher daemon to check for rogue apan
> > processes.  If anyone wants it, email me.
> > 
> > > jeff vier wrote:
> > > 
> > > >I'm having the same problem here.
> > > >
> > > >I have been capturing dumps of the top command, pulling only active
> > > >processes.  It looks like something causes an instance of apan.sh to
> > > >hang, and then they just start piling up (fast).
> > > >
> > > >The load is usually under 1.0 (sometimes jumping up to 1.xx - no big
> > > >deal).  When it died, my load was over 80 (yes eighty) with 46 (maybe
> > > >more) *active* apan processes (not sure of the actual count, top dump
> > > >only shows 62 lines of processes.  It said 73 running, though, so likely
> > > >more were apan.sh - also, unknown count of inactive apan.sh process
> > > >sitting and waiting), 17 zombies (unknown parent, alas). 99% CPU usage
> > > >on CPU0, 100% on CPU1.  Yikes.  This jump happened over 16 minutes, at
> > > >which point my crons no longer ran, so who knows how badly it kept
> > > >piling up.
> > > >
> > > >apan.debug log file doesn't show anything abnormal (whee.)
> > > >
> > > >I'm going to have to write a watcher to manually kill the hanging
> > > >apan.sh procs, which I don't want to do for fear of inadvertently
> > > >killing valid processes, but I am quite sick of having to go over to the
> > > >colo to poke the power button once a week (only been in production 3
> > > >weeks - 4 crashes so far).
> > > >
> > > >I'm going to increase my level of manual debugging, too, of processes,
> > > >etc.  I'll post any new insight.
> > > >
> > > >--jeff
> > > >
> > > >On Wed, 2003-10-08 at 10:31, Matthew Wilson wrote:
> > > >  
> > > >
> > > >>UPDATE: I have checked and my nagios installation does not have ePN compiled
> > > >>in.  So this is not the cause.  I would greatly appreciate any suggestions
> > > >> on how to prevent/cure this problem.
> > > >>
> > > >>    
> > > >>
> > > >>>Thanks
> > > >>>Matthew Wilson.
> > > >>>      
> > > >>>
> > > >>>>Matthew Wilson wrote:
> > > >>>>
> > > >>>>        
> > > >>>>
> > > >>>>>Hi guys,
> > > >>>>>I have read in the list archives in the last couple of months a few
> > > >>>>>threads about nagios and apan chewing up memory.  I have tried a few
> > > >>>>>of the solutions posted but still have no joy.
> > > >>>>>          
> > > >>>>>
> > > >>
> > > >>-------------------------------------------------------
> > > >>This SF.net email is sponsored by: SF.net Giveback Program.
> > > >>SourceForge.net hosts over 70,000 Open Source Projects.
> > > >>See the people who have HELPED US provide better services:
> > > >>Click here: http://sourceforge.net/supporters.php
> > > >>_______________________________________________
> > > >>Nagios-users mailing list
> > > >>Nagios-users at lists.sourceforge.net
> > > >>https://lists.sourceforge.net/lists/listinfo/nagios-users
> > > >>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> > > >>::: Messages without supporting info will risk being sent to /dev/null
> > > >>    
> > > >>
> > > >
> > > >
> > > >
> > > >-------------------------------------------------------
> > > >This SF.net email is sponsored by: SF.net Giveback Program.
> > > >SourceForge.net hosts over 70,000 Open Source Projects.
> > > >See the people who have HELPED US provide better services:
> > > >Click here: http://sourceforge.net/supporters.php
> > > >_______________________________________________
> > > >Apan-users mailing list
> > > >Apan-users at lists.sourceforge.net
> > > >https://lists.sourceforge.net/lists/listinfo/apan-users
> > > >  
> > > >
> > > 
> > 
> > 
> > 
> > -------------------------------------------------------
> > This SF.net email is sponsored by: SF.net Giveback Program.
> > SourceForge.net hosts over 70,000 Open Source Projects.
> > See the people who have HELPED US provide better services:
> > Click here: http://sourceforge.net/supporters.php
> > _______________________________________________
> > Apan-users mailing list
> > Apan-users at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/apan-users



-------------------------------------------------------
This SF. Net email is sponsored by: GoToMyPC
GoToMyPC is the fast, easy and secure way to access your computer from
any Web browser or wireless device. Click here to Try it Free!
https://www.gotomypc.com/tr/OSDN/AW/Q4_2003/t/g22lp?Target=mm/g22lp.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list