A different way?

Andreas Ericsson ae at op5.se
Tue Sep 29 17:06:05 CEST 2009


On 09/25/2009 01:05 AM, Steven D. Morrey wrote:
> Hello everyone,
>
> I've decided to take a break for a bit from multi-threading nagios to
> focus on DNX since that is my day job after all :) While working on
> all of this I had a few thoughts that might make some good ideas if
> Nagios is ever re-designed again, say for a 4.x branch.
>
> As you know, under nagios, all checks are dispatched by nagios to be
> executed on the local machine at set intervals. Under a distributed
> nagios setup, you have multiple nagios instances running on various
> machines executing checks and passing the results back to a passive
> master controller.
>
> Under DNX, we distribute the load to "worker nodes" which then
> execute the checks and hand the results back to an active master
> controller that then processes the result etc.
>
> Profiling shows that (under DNX at least) 2/3rds of our time is spent
> in the reaper processing results, so  wouldn't it make more sense to
> flip the process around?
>
> The checks are already executing on the local machine, so how about a
> daemon on each machine, the daemon would keep the schedule and
> execute service checks locally, processing the result and returning
> the results and the required actions (based on a local policy) to
> nagios which would then do the actual work of handling notifications
> etc and so forth. This way nagios could be an auditor, if it doesn't
> receive a result on time as expected, then it could query the daemon
> to see whats gone wrong, if that fails then it could initiate a host
> check, etc.
>

I'm all for it, provided network checks can still be done from afar
and I don't have to fiddle with a lot of configuration to figure out
which ones are which. That's where this all breaks down though.

Nagios still needs to retain the ability to execute checks on its
own, or it won't be able to monitor things like routers and switches.
Since Nagios has to retain that capability, it might as well check
network checks on other hosts on its own as well. That leaves the
host itself with a few local checks to schedule and run on itself
and then feed the results back up to Nagios for. So what are we
saving here again, and what are the costs?

Savings:
We can run more checks in a shorter time-interval because we're
spreading the work around on more hosts.
There's no network latency involved in *running* the host-local
checks.
We needn't bother with extra hardware to monitor larger networks.

Costs:
Each monitored node needs to have a scheduler on-board (they
already have to have either the plugins or a program that can
understand performance-counters or something similar, so that's
not an additional cost).
Each monitored node needs to be checked from two different ways,
as some of each nodes checks are to be done from another host
for the result to actually be meaningful. This presents a not
insignificant configuration burden.
Each monitored node will have more advanced code running on it,
meaning more bugs and a higher maintenance burden.


The two important savings can be had anyway by simply adding
more systems, and that doesn't involve modifying the monitored
systems at all (unless one wants to install a local agent to
get more detailed monitoring data, ofcourse). Networks that are
large enough to require multiple Nagios servers are almost
invariably owned by large corporations which have no qualms at
all about paying an additional $5.000 for a new server, but
often have policies and laws regulating what kind of software
they're allowed to run on their systems.

I think we'll gain very, very little by moving down this road.
Should we decide, at some point in the future, that it's a good
thing to do, I'm sure the Merlin protocol can be (ab)used to
make such a daemon workable though.


>> From a design standpoint this is a bit more work than the current
>> setup, but it seems to me that this could allow for much greater
>> flexibility and scalability in the long run.
>
> Anyways I hope this sparks a little debate but I don't want to "come
> in and shake things up", or go around changing everything, stepping
> on toes all the while, it's just that putting the responsibility of
> actually executing the check and doing so on time, onto the computer
> it needs to execute on, just makes more sense to me. It's not really
> dramatically different from what we do now, it's just adding a
> scheduler/timer to the existing execution framework and adding
> something to push the original schedule and any changes such as
> scheduled downtime to the appropriate machines, putting everything
> else into a semi passive mode effectively turning each machine to be
> checked into it's own "worker node"
>
> Thoughts?
>

Communication has overhead. DNX doesn't scale up linearly with the
number of poller hosts you add, and neither does Merlin. With the
amount of communication, and the number of servers involved in the
networks we're talking about here, I'm highly skeptical that this
approach will work very well at all. Basically, anything that
involves more than 500 poller nodes will be tricky to maintain
due to the sheer amount of connections the master system is
required to maintain one way or another.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Come build with us! The BlackBerry® Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9-12, 2009. Register now!
http://p.sf.net/sfu/devconf




More information about the Developers mailing list