Large scale network monitoring limits with nagios

Andy Mayhew amayhew at icewire.com
Thu Mar 11 19:40:11 CET 2004


>From my experience, while you are seeing a high CPU load, the real culprit is
Disk I/O.  With RRD writing to various files, along with Nagios logging, and
such, there is actually quite a bit of I/O going on even without the CGI's.
The CGI's in turn get slow waiting to be able to read the Nagios configs and
status file.

My solution for this, under Linux, has been to utilize the fact that memory is
cheap, and have RRD files, status files, and Nagios command files reside on
ram disks.  Saving those files which are not transitory (RRD files) to disk
via a script on a regular basis for backup.  

So, for my distributed environment, I have ~748 servers, ~8400 passive service
checks, an additional ~600 local active service checks, and ~6700 rrd metrics
being stored.  The central server is a dual P-III 850 with 2gigs of ram.  The
average page refresh that we've seen is <5 seconds for the CGI's with 10-15
active users at any time.

Some other notes on my environment.  98% of the passive service checks
recieved are via NSCA with the rest being SNMP traps.  I do aggregation of
NSCA traffic on each of the distributed Nagios servers which dramatically cuts
down on the build-up-tear-down of tcp sockets.  Moving all named pipes used by
Nagios onto the ramdisk also made dramatic improvements in performance.  Of
the average 12 service checks on each host, 8 are system related (cpu util,
disk space, mem, network traffic, etc).  All of the performance/utilization
data from these checks is stored in RRD.  I do this via an OCSP command which
parses the service check return line and does the proper rrd command thing.
The hosts monitored are a combination of Solaris, Linux, network devices
(Cisco, Alteon, Foundry), and Windows hosts.

This is just a glancing overview of how we handled performance issues.  If
anyone is interested in a more detailed HowTo type document, I'll see what I
can whip-up.  Also, I don't do pie charts, but I think my graphs are pretty
okay in comparison to the uptimesoftware folks ;)

--Andy

On Thu, Mar 11, 2004 at 04:52:00AM -0800, Noah Leaman wrote:
> Hopes it's o.k. cross posting to both groups on this matter...
> 
> Using the concept of one service per up/down trap for each network 
> interface, I tested a little by creating a very simple set of nagios 
> configs, but with about 8000 PASSIVE service checks and no active 
> service checks. of course there was no problem in terms of scheduling 
> issues, but the CGIs all crawled to a snails pace. In my setup (nagios 
> 1.2, Dual G4 first-gen xServe) it takes about 30 secs to display the 
> Status Summary page.
> 
> Of course that config setup isn't the actual production plan...
> 
> I enabled the closer to real-world configs:
> 
> 552 check_traffic (2 snmpgets running every 10 minutes per service 
> check storing to an RRD)
> 295 check_ping (number of locally monitored hosts)
> 8389 check_dummy (mostly the up/down Trap and about 100 are passive 
> services coming from 2 other distributed nagios servers doing pings and 
> check_traffics)
> 
> ... So 9236 services all together but this is really just a small 
> subset of what I would like to be able to do. The plan is to through 
> hardware at it to spread out the real work being done (i.e. the active 
> checks).
> 
> But with just this setup, a single CGI take up an entire CPU to run and 
> for a few minutes a lot of the time... and the plan was to have a good 
> handful of GUI users (5 ish at a time)... it's just about unusable with 
> one GUI user.
> 
> How to monitor traps for hundreds of network hosts and tens of 
> thousands different interfaces each of which could generate up/down 
> traps along with other traps. I tried setting up a single "catch-all" 
> trap service per host, but notification would need to occur when going 
> from and OK to another OK (with a different output). Shouldn't this 
> work with is_volatile on and stalking_options set to o,w,u,c (every 
> test I've done to get this working from OK to OK doesn't work... but 
> maybe I missed something).
> 
> So the higher level question here is am I over my head in what or how I 
> can do this with nagios? After tackling the network monitoring needs, 
> the plan was to then start the server monitoring (around 1000 servers 
> of many platforms).
> 
> Any helpful guidance?
> 
> -- 
> Noah
> 
> 
> On Wednesday, March 10, 2004, at 06:51  PM, Noah Leaman wrote:
> 
> >I have over 70,000 interfaces/ports (just the up/up ones) for which I 
> >could receive linkDown and linkUp traps for. And this is just a 
> >sampling of hosts on our network to pilot nagios to see if it can do 
> >what we want. Doesn't it seem a little crazy to have to deal with that 
> >many services even if they are passive? And this is just linkDown and 
> >linkUp. What about all other possible traps that could be received?
> >
> >-- 
> >Noah
> >
> >
> >On Friday, March 5, 2004, at 01:15  AM, Jim Mozley wrote:
> >
> >>Noah Leaman wrote:
> >>
> >>>How do you all address the issue of trap monitoring when you want 
> >>>notifications for them?
> >>
> >>I have done something similar with interfaces, the only way I know is 
> >>to define each interface as a service. I realise this is potentially 
> >>a lot of services. We do this on core network device interfaces, but 
> >>only define services for interfaces that are in use. This is an 
> >>automated process so as interfaces are activated/deactivated they are 
> >>added or removed from the Nagios configuration files. As the only 
> >>alerts are passive ones for these services, it isn't as though one is 
> >>introducing something like a vast increase in active checks.
> >>
> >>HTH,
> >>
> >>Jim Mozley
> >>
> >>
> >>-------------------------------------------------------
> >>This SF.Net email is sponsored by: IBM Linux Tutorials
> >>Free Linux tutorial presented by Daniel Robbins, President and CEO of
> >>GenToo technologies. Learn everything from fundamentals to system
> >>administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >>_______________________________________________
> >>Nagios-users mailing list
> >>Nagios-users at lists.sourceforge.net
> >>https://lists.sourceforge.net/lists/listinfo/nagios-users
> >>::: Please include Nagios version, plugin version (-v) and OS when 
> >>reporting any issue. ::: Messages without supporting info will risk 
> >>being sent to /dev/null
> >>
> >
> >
> >
> >-------------------------------------------------------
> >This SF.Net email is sponsored by: IBM Linux Tutorials
> >Free Linux tutorial presented by Daniel Robbins, President and CEO of
> >GenToo technologies. Learn everything from fundamentals to system
> >administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> >_______________________________________________
> >Nagios-users mailing list
> >Nagios-users at lists.sourceforge.net
> >https://lists.sourceforge.net/lists/listinfo/nagios-users
> >::: Please include Nagios version, plugin version (-v) and OS when 
> >reporting any issue. ::: Messages without supporting info will risk 
> >being sent to /dev/null
> >
>  
>  
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President and CEO of
> GenToo technologies. Learn everything from fundamentals to system
> administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click




More information about the Users mailing list