Distributed Setup of Nagios

Max perldork at webwizarddesign.com
Wed Aug 18 18:21:45 CEST 2010


On Wed, Aug 18, 2010 at 11:07 AM, Kyle O'Donnell <nagios at isprime.org> wrote:
> we have ~ 30000 services and ~3000 hosts
>
> we have 6 pollers (each have a backup) processing checks and forwarding
> back to a central nagios host.
>
> our busiest poller has ~1000 hosts and ~9000 services... avg service check
> interval is 5 minutes, but there are a bunch at 1 and 2 minute intervals.
>
> avg service check latency is less than 1 second
>
> This is ~3yr old hardware too, i suspect we could increase capacity by 50%
> if we move to the new intel nahalems

Nice - appreciate you sharing your numbers - everyone who does
distributed code around Nagios adds overhead, so it is nice to see
real numbers as opposed to 'as many as can be done' as we all know how
wildly that varies :) - I have spent many many hours with my
colleagues tuning the 'as many as can be done' numbers.

We have done a distributed variant of Nagios as well - our
non-distributed pollers (Compaq 380s with 8 GB RAM + RAID 10) poll 2k
host checks (every 10 minutes) and 11k service checks (avg interval 5
minutes), all checks send performance data through a NEB module as
well to our performance data processing tier - with our distributed
code in place that falls to around 1.5k host checks and 8-9k service
checks per poller.

Average non-distributed host and service check latency around 1.2
seconds, distributed around 2.4 seconds.

Our new hardware consists of Dell R710s - dual 8 core processors, wow
do those rock - with our distributed code we are getting around 2x
those numbers per poller even with the overhead of the distribution
mechanism in place.

We will be releasing our distributed variant as open source software
in the next month or so - i suspect that our methodology is org
specific enough that it will not work for many places, but for higher
volume polling it might be worthwhile to adopt and some of the
concepts and metholodigies in it we hope will lead to sparking ideas
in others for better ways to do distributed Nagios.

We also take the approach of pushing out configs to remote pollers -
we have a redundant UI tier where we stage a configuration - after the
configuration is staged, we have code (will allow for manual operator
adjustment in a dot release) that will equally distribute checks among
pollers desginated as being available for use - that code then builds
out a common retention.dat file for all pollers along with
objects.pre-cache files for each poller - those files are pushed out
to each poller and the pollers are restarted (yes, we have thought
through and worked out all the synchronization issues involved).

Our UI then lets users take the actions the Nagios Ui does and knows
where to send the commands to affect the real poller instances.

Working well so far, and as with all the alternate Nagios UIs, we are
able to make a much more intuitive and flexible UI.

Code should be available in early October.

- Max

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list