Core 4 Remote Workers

Eric Stanley estanley at nagios.com
Tue Feb 5 12:43:10 CET 2013


On 2/2/13 5:52 PM, Andreas Ericsson wrote:
> On 02/02/2013 03:12 PM, Eric Stanley wrote:
>> 3. Add a host key to the worker registration to allow workers to specify
>> the host(s) for which it will handle checks.
> Not really difficult, although I suspect one will want to use groups
> instead of specific hosts, and also use the address which the other
> node is connecting from as the host to monitor (so one can have self-
> monitoring servers that phone in to Nagios with their results).
I like the idea of the remote worker checking itself by default, but I 
think we should allow the remote worker to exclude itself from checking 
it's host or maybe from checking certain aspects of itself using the 
check type concept you proposed below.
>> The reason I have steps 1 and 2, instead of combining them is first,
>> because a generalized solution is more extensible and second, I think
>> having multiple TCP listeners is a reasonable use case where you have a
>> multi-homed system, but you may not want to listen on all interfaces.
>>
> That can be firewalled away quite trivially, so no need for us to handle
> that with code that might break (as I suspect it will see little testing).
I've got to believe that most of the effort and testing would be going 
from 1 listener to 2 and that going from 1 to n is a simple 
generalization of the 1 to 2 case. Telling the main daemon not to listen 
on certain interfaces provides security in depth.
>> The host key should be allowed to specify one or more IP addresses, IP
>> subnets, contiguous IP address ranges, host names and host name
>> patterns/wildcards (i.e. *.example.com). If multiple workers register
>> for the same host, some sort of distribution mechanism should be used to
>> load balance the workers.
>>
> Umm... Is this what the remote worker should request? If so, we're doing
> a pretty major change in Nagios where a hosts address is always just a
> string that we pass to the plugins, and it won't be long until people
> start requesting regex matching, subdomain matching and whatnot for it,
> and we'll have to start resolving hostnames.
>
> I'd say just go with hostgroups instead. It's easier, and people will
> have to do some minor configuring of remote workers anyway, so saying
> "hostgroups=core-routers" in that config in addition to ip and port
> to Nagios isn't such a big chore.
Maybe I wasn't clear. I don't see a change in the way Nagios itself 
performs checks. It is just the worker specifying the systems for which 
it is willing and able to perform checks. If you configure Nagios to 
check host x and no worker registers specifically to check host x, 
Nagios will use workers that have not specified the hosts for which 
they'll perform checks, which at least for now, defaults to the local 
workers.

I find the idea of the remote worker using hostgroups to volunteer for 
checks appealing because of its simplicity, but might it not be fragile? 
Assume the members of the hostgroup must be checked using a remote 
worker because of network configuration. If someone removes a host from 
the hostgroup, it will cease to be checked. If someone deletes (or 
renames) the hostgroup, none of the hosts will be checked. If someone 
adds a host to the hostgroup that the remote worker cannot check, it 
will never be checked. You might spend a lot of time trying to figure 
out why your hosts/services aren't being checked in one of these cases.
>> Using the second criteria of host to determine which worker gets the
>> check raises the question of the order of precedence for the criteria.
>> Initially, I think the host should have precedence over plugin, but I
>> can see implementing and order of precedence option in the core
>> configuration file. This would be more important if additional worker
>> selection criteria were added.
>>
> Object over check type, any day. We may have to add a "check_type" thing
> to command objects though, so workers can register for only local checks
> and still have their http checks and whatnot done from remote, where
> they make more sense. This requires some thinking.
Good thought. Maybe an "internal" check type for things like disk space 
used, CPU usage, process running, etc., and an "external" type for over 
the network checks such as http or ssh availability? Or could 
servicegroups be used for this functionality? (Now I'm wondering whether 
I'm arguing against myself with my concerns about hostgroups above. :-))
>
>> The communication between the remote worker and the core process should
>> be able to be protected by SSL. The remote worker will need a mechanism
>> to retry the connection in the event the network drops the connection.
>>
> Retrying the connection is the easy part. What should it do with the
> jobs its running while the upstream connection is dead? More importantly,
> how should core Nagios react to the checks it's supposed to run when the
> connection is down? Issuing "check_disk / -w 90 -c 95" or something is
> a pretty bad idea.
If the upstream connection is dead, the remote worker won't receive any 
new checks to perform, so I don't see that as an issue. It may have 
check results that it cannot return and maybe there should be a 
freshness timeout on such checks, so they can be discarded after a 
while. We may want to slow down our reconnection attempts after the 
first few failed attempts, just so we don't waste resources on the 
remote host.

As for the core Nagios process, if the remote worker is disconnected, 
the core process sees the disconnect and unregisters the remote worker. 
In this case, some other worker may get the next check. If that other 
worker cannot reach the host to be checked or no worker can perform the 
check, you'd get a host unreachable or down state. If the other worker 
can reach the host to be check, all is well except that you've lost some 
of your distributed monitoring capability.
>
> Encryption is a must, ofcourse, as the packets will have to contain
> passwords some of the time. There's a libssh2 available which we should
> be able to use to set up preshared key authentication with security
> that even NSA will approve of.
I like the idea of libssh2. SSH is simpler both in concept and 
implementation than a PKI.


-- 
Eric Stanley
___
Developer
Nagios Enterprises, LLC
Email:  estanley at nagios.com
Web:    www.nagios.com


------------------------------------------------------------------------------
Free Next-Gen Firewall Hardware Offer
Buy your Sophos next-gen firewall before the end March 2013 
and get the hardware for free! Learn more.
http://p.sf.net/sfu/sophos-d2d-feb




More information about the Developers mailing list