How many hosts and services are you monitoring with Nagios?

Simone Felici s.felici at mclink.eu
Fri May 18 09:33:28 CEST 2012


Impressive :)
We're monitoring ~2000 hosts and ~10000 services, every 5 minutes.
Architecture used: OPSView Community edition, the last free version before it started to make the 
distributed version commercial :/
Two central servers (active/standby - drbd) as single point for management and collecting all 
passive checks executed by the slave servers. Performance data saved into rrd files as well on an 
external BIG database server. Configuration resides on a cluster MySQL installation (drbd).
4 slave "datacenter installations" with 2 servers per "datacenter" in active/active load balancing.
Traps handling supported on all servers with rules logic.
Pros:
- Open Source: at least until version 3 - for our setup. Simple single instance with fewer functions 
available as well on version 4.
- Easy to manage: the prupose was to create monitoring system and then let the management to other 
people with less technical skills
- distributed setup
- RBAC
Disadvantages:
- no longer Open Source: see above
- Central server suffering on cpu by GUI implementation and other bg jobs
- Not all nagios parameters editable as we like: i.e. cannot customize same checks with different 
intervals without having to re-create new ones. Think on HTTP service on servers with different 
loads and the need to extend the retries on high load servers. no way expect creating "HTTP" and 
"HTTP High Load" services.
Maybe there are more pros (and disadvantages), but it's not the right place.
BTW I'll look forward to wait for this solution; seems interesting!

Simon

Il 17/05/2012 16:43, Max Schubert ha scritto:
> Hi,
>
> I like it when people periodically post numbers and architecture
> summaries, I am guessing with the distributed frameworks out now for
> Nagios this thread might be seeing bigger numbers than past threads
> have.
>
> With our custom-built distributed Nagios-based monitoring system, we
> are currently monitoring 18000+ hosts every 5 minutes and 100k+ active
> services (plenty of passive services in addition to the actives) every
> 5 mins as well.  We collect performance data from every check as well
> and pass that on to a highly distributed and scalabe time-series data
> warehouse another team in our organization has built (which is why we
> have the 5 min interval requirement)
>
> We also do trap ingest using SNMPTT with a few custom mods, but not
> going to include those numbers as they never have required the
> optimizations the polling has required.
>
> This isn't a monolithic instance, we have 6 projects using instances
> of our distributed Nagios-based software, called Racon (soon my
> manager will give our team to package it as open source - so I hear at
> least).  We built it on core Nagios with a custom database layer based
> on a very very early version of Merlin's database abstraction layer
> (thank you Andreas!) - we have a custom client/server network-based
> notification framework in use (we will release that as well) along
> with a custom NEB/perl based client-server framework (also releasable,
> just need time scheduled) for sending and processing performance data
> - the performance and notification framework are both horizontally
> scalabe and network fault tolerant.
>
> What kinds of numbers of hosts and services are you all monitoring?
> Which add-ons / distributed frameworks are you using?

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list