Nagios as a Service Resiliency Manager

Thomas Guyot-Sionnest dermoth at aei.ca
Mon Dec 14 05:37:21 CET 2009


On 10/12/09 12:08 PM, Christopher McAtackney wrote:
> Hi all,
> 
> I have a need to control an Active / Passive pair of components and
> was wondering if anyone had tackled this problem with Nagios?
> 
> The scenario is as follows;
> 
> Host A has SERVICE_1 installed and running. Host B has SERVICE_2
> installed, but not running.
> 
> The desired functionality is to detect when SERVICE_1 is not running
> (or that Host A is down / unreachable), and then to start SERVICE_2 on
> Host B.
> 
> I believe I can do this with Nagios by defining an event handler on
> SERVICE_1 which will make the appropriate call to start SERVICE_2 on
> Host B
> 
> Would it make sense to store the relationship between SERVICE_1 and
> Host B / SERVICE_2 as a service macro, e.g.
> $_SERVICE_PASSIVE_HOSTNAME, $_SERVICE_PASSIVE_SERVICENAME?
> 
> There are too many scenarios in which the SERVICE_1 might come back up
> to try automate the switching off of SERVICE_2 I believe, e.g. if
> someone pulled a network cable on Host A accidently, then plugged it
> in 15 minutes later - during which time Nagios detects that it is down
> and so starts up SERVICE_2. The user then plugs the network lead back
> in and now we have two Active instances running - which is what we
> specifically wanted to avoid. Even if Nagios detects that the primary
> component is up, it's still too late because any Active / Active
> overlap will cause problems for this particular application.
> 
> I can't think of any way to automate that side of things - but does
> the general concept of having Nagios start up a Passive partner make
> sense?

Short answer: not really.

You're talking about clustering here, and clustering has its very own 
set of problems than Nagios was never meant to solve. You should rather 
spend your time looking at a real clustering solution like Linux-HA (I 
used this one but I know there's other OSS clustering software around...).

Once you have your cluster set up then it makes sense to monitor the 
services *and* the cluster software using Nagios. For failover services 
I find the easiest way is you use a shared IP (IP that moves from one 
server to the other along with the services - this is very easy to add 
once the cluster is set-up) so you always look for the service where 
it's supposed to be running. If a shared IP isn't an option just monitor 
the service on both servers and use check_cluster to detect across all 
nodes.

I'm not saying that you can't achieve this using Nagios...  It might 
actually work for very simplistic scenarios but even in that case you 
may end up accidentally running the service on both servers if you're 
not very careful (something that a clustering software sill not let 
happen). You have to take into account not only every possible failure 
scenarios but also every possible thing a human could be doing at the 
same time your handlers try to recover the service! If kind of like 
reinventing the wheel, but not even using the right tools :)

-- 
Thomas

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list