[Nagios-users] Large scale network monitoring limits with nagios

Jason Lancaster jason at teklabs.net
Thu Mar 11 16:29:44 CET 2004


Noah Leaman wrote:

> Using the concept of one service per up/down trap for each network 
> interface, I tested a little by creating a very simple set of nagios 
> configs, but with about 8000 PASSIVE service checks and no active 
> service checks. of course there was no problem in terms of scheduling 
> issues, but the CGIs all crawled to a snails pace. In my setup (nagios 
> 1.2, Dual G4 first-gen xServe) it takes about 30 secs to display the 
> Status Summary page.
>
> ... So 9236 services all together but this is really just a small 
> subset of what I would like to be able to do. The plan is to through 
> hardware at it to spread out the real work being done (i.e. the active 
> checks).
>
> But with just this setup, a single CGI take up an entire CPU to run 
> and for a few minutes a lot of the time... and the plan was to have a 
> good handful of GUI users (5 ish at a time)... it's just about 
> unusable with one GUI user.

I'm using a distributed environment of 4 servers to monitor 6200 
services so I'm not displaying quite as much as you but I am close. My 
designated central server that runs the cgi's is a dual AMD 2200 with 
3gb of ram. I am not using 1.2, I am using 1.1 with a cgi patch 
submitted to the devel list by David Parrish. Viewing cgi's as an admin 
user who has access to all services/hosts causes no problems for me. I 
have not tested 1.2 because 1.1 works quite well for me and I have not 
wanted any headaches.

The only complaint I have about the cgi's after the patch is that they 
take up between 20-50% of a cpu every time someone loads them up. If too 
many people in the company are browsing around things can get really 
slow. I used to cache some of the pages every few minutes but I just 
didn't like the idea of caching the data.

> How to monitor traps for hundreds of network hosts and tens of 
> thousands different interfaces each of which could generate up/down 
> traps along with other traps. I tried setting up a single "catch-all" 
> trap service per host, but notification would need to occur when going 
> from and OK to another OK (with a different output). Shouldn't this 
> work with is_volatile on and stalking_options set to o,w,u,c (every 
> test I've done to get this working from OK to OK doesn't work... but 
> maybe I missed something).

Mmmm, this is def a users question. Personally, I do not use the 
volatile option because we rely entirely on web interfaces (no email 
notifications) to let us know what is going on. I have a "trap server" 
running a "snmptrapd log watcher" program which watches the snmptrapd 
log for events. If a failure on a device triggers a trap with a oid that 
is recognized it flags the service as critical until someone 
acknowledges it in the web interface.

Lots of people have other ways of accomplishing this.

> So the higher level question here is am I over my head in what or how 
> I can do this with nagios? After tackling the network monitoring 
> needs, the plan was to then start the server monitoring (around 1000 
> servers of many platforms).

If I ever migrate to 1.2, I'll be sure to let the list know if I have 
cgi slowness.

Jason


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click




More information about the Users mailing list