A different way?

Andreas Ericsson ae at op5.se
Wed Oct 14 11:04:05 CEST 2009


On 10/12/2009 09:28 PM, Gaspar, Carson wrote:
> Apologies for replying to this thread rather late, but I figured I
> should speak up, as someone who has implemented a distributed design.
> More apologies for hellish Outlook quoting, which I have attempted to
> make legible :-(
>

I just rewrapped everything now. The lines were somewhat in excess of
400 characters. Sorry about that, but I couldn't read the mail while
editing the reply otherwise.

> -----Original Message----- From: Andreas Ericsson [mailto:ae at op5.se]
>
>> On 09/25/2009 01:05 AM, Steven D. Morrey wrote:
>>> The checks are already executing on the local machine, so how
>>> about a daemon on each machine, the daemon would keep the
>>> schedule and execute service checks locally, processing the
>>> result and returning the results and the required actions (based
>>> on a local policy) to nagios which would then do the actual work
>>> of handling notifications etc and so forth. This way nagios could
>>> be an auditor, if it doesn't receive a result on time as
>>> expected, then it could query the daemon to see whats gone wrong,
>>> if that fails then it could initiate a host check, etc.
>
> I see 2 or 3 major differences between your proposal and the current
> passive schemes:
>  - Nagios can more easily poke "lost" systems (you
> can do this now with UNKNOWN and some clever notification&
> escalation configs, or possibly with obsessing, but it's far more
> obscure and convoluted)
> - If I understand you, you're also proposing
> pushing the flap detection logic (and possibly more, but determining
> what else has no off-host dependencies is difficult - dependency
> checks would need to be central, for example)
> - It would be possible
> for Nagios to act as a configuration management system for the
> monitoring config of the remote nodes, instead of requiring some
> outboard system
>

I don't know which one of these attributes you see in which solution,
so I really can't comment on them.

>> Nagios still needs to retain the ability to execute checks on its
>> own, or it won't be able to monitor things like routers and
>> switches.
>
> No, it doesn't. You can monitor those things via plug-ins that run on
> worker nodes. This is _especially_ important for things like latency
> monitoring, where you may want your probe point to be a different
> place on the network than you Nagios server.
>

You quoted this out of context. The paragraph just above it is really
the important one. Since each agent-daemon would act as a very small
Nagios daemon, "Nagios" in this sense can be any of the multitudes of
Nagios daemons. The missing paragraph was this, btw:

"I'm all for it, provided network checks can still be done from afar
and I don't have to fiddle with a lot of configuration to figure out
which ones are which. That's where this all breaks down though."

>> The two important savings can be had anyway by simply adding more
>> systems, and that doesn't involve modifying the monitored systems
>> at all (unless one wants to install a local agent to get more
>> detailed monitoring data, ofcourse). Networks that are large enough
>> to require multiple Nagios servers are almost invariably owned by
>> large corporations which have no qualms at all about paying an
>> additional $5.000 for a new server, but often have policies and
>> laws regulating what kind of software they're allowed to run on
>> their systems.
>
>> I think we'll gain very, very little by moving down this road.
>> Should we decide, at some point in the future, that it's a good
>> thing to do, I'm sure the Merlin protocol can be (ab)used to make
>> such a daemon workable though.
>
> Speaking as someone that actually works at one of those "large
> corporations" (and has worked at several others), You're smoking
> crack. We care deeply about bad scaling, and are not willing to buy
> 100 servers (not an exaggeration for 2.x, probably more like 20-40
> servers for 3.x) to fix bad code design. If I hadn't written a
> passive check framework, we would never have been able to deploy
> Nagios.
>

Let's say 30 servers for Nagios 3. That means you're executing about
half a million checks every five minutes, or 100,000 checks per minute
according to the scalability tests we've run.

>> Communication has overhead. DNX doesn't scale up linearly with the
>> number of poller hosts you add, and neither does Merlin. With the
>> amount of communication, and the number of servers involved in the
>> networks we're talking about here, I'm highly skeptical that this
>> approach will work very well at all. Basically, anything that
>> involves more than 500 poller nodes will be tricky to maintain due
>> to the sheer amount of connections the master system is required to
>> maintain one way or another.
>
> We currently have well over 3,000 "poller nodes" per Nagios instance,
> with multiple instances running on a single server. The communication
> overhead is trivial compared to the savings. Note that Nagios
> communicates _zero_ status to my pollers, all communication is in the
> other direction.

That means you can't disable checks or force checks without fiddling
with the proper host then. Or have you solved that too?

> A more entangled design (which this appears to be)
> would, indeed, have higher overhead, but it should still be minimal
> if it were intelligently designed.

It is more complex, certainly, but since the goal is to provide wide
area scalability to huge networks without sacrificing any capabilities,
it has to be that way. One-way communication simply isn't enough for
that.

> I can't comment on the current
> architecture of DNX or Merlin, as I haven't looked at them. Perhaps
> as they are more full featured than would be required for a poller
> control protocol, they are far chattier, and more synchronous, and
> thus have more overhead.
>

Actually, the merlin protocol normally has *less* overhead than the
way passive checks are handled today, since it only sends as many
bytes as is required rather than a fixed size buffer, and also uses
a binary protocol which means it's far more dense and thus contains
more information in less actual space. There's also no need to
convert to and from strings. Not that that's a great chore, but it
all adds up.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference




More information about the Developers mailing list