Core 4 Remote Workers

Andreas Ericsson ae at op5.se
Tue Feb 5 17:57:20 CET 2013


On 02/05/2013 12:43 PM, Eric Stanley wrote:
> On 2/2/13 5:52 PM, Andreas Ericsson wrote:
>> On 02/02/2013 03:12 PM, Eric Stanley wrote:
>>> 3. Add a host key to the worker registration to allow workers to specify
>>> the host(s) for which it will handle checks.
>> Not really difficult, although I suspect one will want to use groups
>> instead of specific hosts, and also use the address which the other
>> node is connecting from as the host to monitor (so one can have self-
>> monitoring servers that phone in to Nagios with their results).
> I like the idea of the remote worker checking itself by default, but I
> think we should allow the remote worker to exclude itself from checking
> it's host or maybe from checking certain aspects of itself using the
> check type concept you proposed below.
>>> The reason I have steps 1 and 2, instead of combining them is first,
>>> because a generalized solution is more extensible and second, I think
>>> having multiple TCP listeners is a reasonable use case where you have a
>>> multi-homed system, but you may not want to listen on all interfaces.
>>>
>> That can be firewalled away quite trivially, so no need for us to handle
>> that with code that might break (as I suspect it will see little testing).
>
> I've got to believe that most of the effort and testing would be going
> from 1 listener to 2 and that going from 1 to n is a simple
> generalization of the 1 to 2 case.

The 1 to n case is already handled. That's the whole point behind the I/O
broker.

> Telling the main daemon not to listen
> on certain interfaces provides security in depth.

Except that people won't trust it, and it's ludicrously simple to set
firewall rules for it. But by all means, adding multiple network sockets
really isn't any harder than adding one, so what the hey.

>>> The host key should be allowed to specify one or more IP addresses, IP
>>> subnets, contiguous IP address ranges, host names and host name
>>> patterns/wildcards (i.e. *.example.com). If multiple workers register
>>> for the same host, some sort of distribution mechanism should be used to
>>> load balance the workers.
>>>
>> Umm... Is this what the remote worker should request? If so, we're doing
>> a pretty major change in Nagios where a hosts address is always just a
>> string that we pass to the plugins, and it won't be long until people
>> start requesting regex matching, subdomain matching and whatnot for it,
>> and we'll have to start resolving hostnames.
>>
>> I'd say just go with hostgroups instead. It's easier, and people will
>> have to do some minor configuring of remote workers anyway, so saying
>> "hostgroups=core-routers" in that config in addition to ip and port
>> to Nagios isn't such a big chore.
> Maybe I wasn't clear. I don't see a change in the way Nagios itself
> performs checks. It is just the worker specifying the systems for which
> it is willing and able to perform checks. If you configure Nagios to
> check host x and no worker registers specifically to check host x,
> Nagios will use workers that have not specified the hosts for which
> they'll perform checks, which at least for now, defaults to the local
> workers.
> 
> I find the idea of the remote worker using hostgroups to volunteer for
> checks appealing because of its simplicity, but might it not be fragile?
> Assume the members of the hostgroup must be checked using a remote
> worker because of network configuration. If someone removes a host from
> the hostgroup, it will cease to be checked. If someone deletes (or
> renames) the hostgroup, none of the hosts will be checked. If someone
> adds a host to the hostgroup that the remote worker cannot check, it
> will never be checked. You might spend a lot of time trying to figure
> out why your hosts/services aren't being checked in one of these cases.

Same problem with misconfiguration of anything at all. Except for the
"add a host the worker can't check". In that case it will run headlong
into whatever error the inability to check the node would cause and
raise an alert from that.

Using a resolved address will be even more fragile. What do we do when
a host resolves to multiple ip-addresses and some but not all of them
are within the segment the worker is supposed to check?

If we do that, I'd say we either force people to use standard IP addresses
in the address field or get them to mark their remote-worker checks by
a custom variable which the worker has to request in its registration.

It *will* require additional configuration. There's no way around that.
Us trying to play clever games and guessing which nodes to monitor based
on IP will bite us in the ass. Not least when companies merge and they
have to take over a second network with conflicting IP's all over the
place.

>>> Using the second criteria of host to determine which worker gets the
>>> check raises the question of the order of precedence for the criteria.
>>> Initially, I think the host should have precedence over plugin, but I
>>> can see implementing and order of precedence option in the core
>>> configuration file. This would be more important if additional worker
>>> selection criteria were added.
>>>
>> Object over check type, any day. We may have to add a "check_type" thing
>> to command objects though, so workers can register for only local checks
>> and still have their http checks and whatnot done from remote, where
>> they make more sense. This requires some thinking.
>
> Good thought. Maybe an "internal" check type for things like disk space
> used, CPU usage, process running, etc., and an "external" type for over
> the network checks such as http or ssh availability? Or could
> servicegroups be used for this functionality? (Now I'm wondering whether
> I'm arguing against myself with my concerns about hostgroups above. :-))

Heh :D

>>
>>> The communication between the remote worker and the core process should
>>> be able to be protected by SSL. The remote worker will need a mechanism
>>> to retry the connection in the event the network drops the connection.
>>>
>> Retrying the connection is the easy part. What should it do with the
>> jobs its running while the upstream connection is dead? More importantly,
>> how should core Nagios react to the checks it's supposed to run when the
>> connection is down? Issuing "check_disk / -w 90 -c 95" or something is
>> a pretty bad idea.
>
> If the upstream connection is dead, the remote worker won't receive any
> new checks to perform, so I don't see that as an issue. It may have
> check results that it cannot return and maybe there should be a
> freshness timeout on such checks, so they can be discarded after a
> while. We may want to slow down our reconnection attempts after the
> first few failed attempts, just so we don't waste resources on the
> remote host.
> 

Ya, well, that's all in the worker. Retrying once every 15-30 seconds
will cause absolutely no strain on the remote system while still being
fast enough that it will seem to be immediate when someone fixes the
broken firewall.

> As for the core Nagios process, if the remote worker is disconnected,
> the core process sees the disconnect and unregisters the remote worker.
> In this case, some other worker may get the next check. If that other
> worker cannot reach the host to be checked or no worker can perform the
> check, you'd get a host unreachable or down state. If the other worker
> can reach the host to be check, all is well except that you've lost some
> of your distributed monitoring capability.

Well, that's the split brain problem, and there's no solution to that.
However, that problem is already present in DNX and mod_gearman, and
people still use those products to great effect.

>>
>> Encryption is a must, ofcourse, as the packets will have to contain
>> passwords some of the time. There's a libssh2 available which we should
>> be able to use to set up preshared key authentication with security
>> that even NSA will approve of.
> I like the idea of libssh2. SSH is simpler both in concept and
> implementation than a PKI.
> 

We could also go with unencrypted at first and just make sure it works
with stunnel or some such. It doesn't really matter, as organizations
large enough to require large-scale distributed setups will have
people who can handle encryption just fine.

In either case, we should definitely have a cleartext option too, for
debugging if nothing else.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb




More information about the Developers mailing list