Nagios centralized server BUG???

Steven L. Kohrs steve at dtnspeed.net
Wed Jan 15 15:53:50 CET 2003


What's the status on this problem?  I'm experiencing the same thing on
RH 6.2.  I've read the tuning docs, but I can't find anything about
handling this.  Is it a parallelization issue?  I set
max_concurrent_checks=100.  After restarting Nagios 50 minutes ago, I've
got 208 processes.  I had a similar problem on a remote host, but
setting max_concurrent_checks=5 took care of it.

I've got one central server performing 30 active checks and accepting
about 900 passive checks from 30 remote servers via NSCA.  

I believe the problem, which Gerald described below, is a result of too
many processes running.  I can't hardly perform a 'ps' command when
Nagios runs away.  How can it update a service status?

Thanks,

Steve Kohrs

From: Burnson, Richard <rburnson at cp...>
RE: Nagios centralized server BUG???  
2003-01-03 15:10
 
I tried running Nagios on RH 8.0 as well.  (Part of my plan to setup a
distributed system, see previous e-mail)   I left the existing server
running RH 7.2, and on a duplicate machine with the exact same hardware
I installed RH 8.0.  (Dual 1 Ghz processes and 1 GB RAM, in the same
model server)  I installed nagios and moved the configs over from the
7.2 box. While the 7.2 box has run w/o a hitch for 1.5 years, the RH 8.0
box would run out of memory and the kernel would kill the nagios
process(es).  So I blew away Rh 8.0 and installed 7.2 on the box, and
was able to run the Nagios setup the same as the original.  Not sure
what gives, but it seems like 8.0 has some bugs in it that red Hat needs
to still work out.  So my recommendation is to run it on 7.2 or 7.3
until 8.x is stable.
  
Richard
  
-----Original Message-----
From: Gerald Wichmann [mailto:gwichman at za...] 
Sent: Friday, January 03, 2003 4:46 PM
To: Nagios (E-mail)
Subject: [Nagios-users] Nagios centralized server BUG???
  
Well I'm about to give up and install this central server on another
box. Running it on RH8 and it's driving me nuts. I have 1 central server
accepting only passive service checks. Also 2 distributed servers which
submit passive checks to the centralized server's nsca daemon. Watching
/var/log/messages I can clearly see all the EXTERNAL COMMANDS being
submitted exactly as I'd expect them to. All services are reporting and
showing up OK including Ping. Yet when I look at "host detail" or
"service detail" something doesn't mesh.. Either there's a bug in nagios
or I seriously have something wacky going on here..
  
Despite the fact that all services report ok, under "host details" I
have a variety of servers showing up as RED/DOWN.. Last Check is recent.
Status Information is always "CRITICAL - Plugin timed out after 10
seconds". Status is either UNREACHABLE (most of them), or DOWN (1 of
them).
  
Ok so I click on "service details".. over there all services report "OK"
and green. For some odd reason the Ping services are old in the last
checked column. Like 7 hours.. Even though I can watch /var/log/messages
and see that I'm receiving PING updates as OK regularly.. The other
services mostly have recent updates but there are a lot of them that are
1,2, and even 3 hours out of date. Why is my services detail page so out
of date?
  
Someone points out that I may have multiple nagios servers running on
the machine and well yes that's partially true. Initially when I start
nagios it spawns one nagios -d process but soon they start to multiply.
Long term I have seen them climb up to 4000 which seems excessive to me.
Far as I can tell they don't reduce in numbers nor do they seem to go
much higher then 4000. We're running netsaint in a much larger
distributed environment here checking hundreds and hundreds of services
and it also spawns multiple netsaint processes.. but not as many.. seems
to top out usually around 500.. so as far as I can tell this behavior of
multiple processes is normal.
  
So what the hell is going on here? Does anyone out there run a
distributed environment with a centralized server?
  
Gerald Wichmann
Senior Systems Development Engineer
Zantaz, Inc.
925.598.3099 (w)





-------------------------------------------------------
This SF.NET email is sponsored by: Take your first step towards giving 
your online business a competitive advantage. Test-drive a Thawte SSL 
certificate - our easy online guide will show you how. Click here to get 
started: http://ads.sourceforge.net/cgi-bin/redirect.pl?thaw0027en




More information about the Users mailing list