A different way?

Gaspar, Carson Carson.Gaspar at gs.com
Mon Oct 12 21:28:03 CEST 2009


Apologies for replying to this thread rather late, but I figured I should speak up, as someone who has implemented a distributed design. More apologies for hellish Outlook quoting, which I have attempted to make legible :-( 

-----Original Message-----
From: Andreas Ericsson [mailto:ae at op5.se] 

>On 09/25/2009 01:05 AM, Steven D. Morrey wrote:
>> The checks are already executing on the local machine, so how about a
>> daemon on each machine, the daemon would keep the schedule and
>> execute service checks locally, processing the result and returning
>> the results and the required actions (based on a local policy) to
>> nagios which would then do the actual work of handling notifications
>> etc and so forth. This way nagios could be an auditor, if it doesn't
>> receive a result on time as expected, then it could query the daemon
>> to see whats gone wrong, if that fails then it could initiate a host
>> check, etc.

I see 2 or 3 major differences between your proposal and the current passive schemes:
- Nagios can more easily poke "lost" systems (you can do this now with UNKNOWN and some clever notification & escalation configs, or possibly with obsessing, but it's far more obscure and convoluted)
- If I understand you, you're also proposing pushing the flap detection logic (and possibly more, but determining what else has no off-host dependencies is difficult - dependency checks would need to be central, for example)
- It would be possible for Nagios to act as a configuration management system for the monitoring config of the remote nodes, instead of requiring some outboard system

> Nagios still needs to retain the ability to execute checks on its
> own, or it won't be able to monitor things like routers and switches.

No, it doesn't. You can monitor those things via plug-ins that run on worker nodes. This is _especially_ important for things like latency monitoring, where you may want your probe point to be a different place on the network than you Nagios server.

> The two important savings can be had anyway by simply adding
> more systems, and that doesn't involve modifying the monitored
> systems at all (unless one wants to install a local agent to
> get more detailed monitoring data, ofcourse). Networks that are
> large enough to require multiple Nagios servers are almost
> invariably owned by large corporations which have no qualms at
> all about paying an additional $5.000 for a new server, but
> often have policies and laws regulating what kind of software
> they're allowed to run on their systems.

> I think we'll gain very, very little by moving down this road.
> Should we decide, at some point in the future, that it's a good
> thing to do, I'm sure the Merlin protocol can be (ab)used to
> make such a daemon workable though.

Speaking as someone that actually works at one of those "large corporations" (and has worked at several others), You're smoking crack. We care deeply about bad scaling, and are not willing to buy 100 servers (not an exaggeration for 2.x, probably more like 20-40 servers for 3.x) to fix bad code design. If I hadn't written a passive check framework, we would never have been able to deploy Nagios.

> Communication has overhead. DNX doesn't scale up linearly with the
> number of poller hosts you add, and neither does Merlin. With the
> amount of communication, and the number of servers involved in the
> networks we're talking about here, I'm highly skeptical that this
> approach will work very well at all. Basically, anything that
> involves more than 500 poller nodes will be tricky to maintain
> due to the sheer amount of connections the master system is
> required to maintain one way or another.

We currently have well over 3,000 "poller nodes" per Nagios instance, with multiple instances running on a single server. The communication overhead is trivial compared to the savings. Note that Nagios communicates _zero_ status to my pollers, all communication is in the other direction. A more entangled design (which this appears to be) would, indeed, have higher overhead, but it should still be minimal if it were intelligently designed. I can't comment on the current architecture of DNX or Merlin, as I haven't looked at them. Perhaps as they are more full featured than would be required for a poller control protocol, they are far chattier, and more synchronous, and thus have more overhead.

-- 
Carson Gaspar

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference




More information about the Developers mailing list