Improving the host <parents> logic

Andreas Ericsson ae at op5.se
Wed Dec 14 23:30:29 CET 2005


Shane Stixrud wrote:
> Nagios's host parent logic is good but it could be a whole lot better 
> for todays switched networks.  There has been a couple of 
> recommendations in the past on how to improve this.
> 
> 1) Allow nagios admins to change parent logic failure detection in cases 
> where one parent is up but others are down.  By default nagios treats 
> multiple parents as redundant paths and thus does not suppress 
> notification in situations where at least one parent is OK.
> 
> The main disadvantage to this proposal is nagios rightly treats parents 
> as directly connected HOPs on the path back to nagios.  This work around 
> would treat switches and routers as peers when they are not, removing 
> the possibility of redundancy detection and easily determining which 
> device is at fault.
> 
> 2) Allow the nagios admins to assign a weighted priority to each host 
> and have a system that allows the admin to tune these values to suppress 
> notification where appropriate.
> 
> This type of solution in IMO is way more complex than is required, the 
> best part of the current solution is its simple to management and 
> obvious to deploy.
> 
> The main problem with the existing solution is modern switched networks 
> often have A LOT of managed nodes connected to one or more layer2 
> switches in the same layer3 network.  Ideally nagios would allow admins 
> to suppress notification for both devices behind both layer2 devices and 
> layer3 interfaces.


I sense some cheap-shot parent auto-detection junk using traceroute 
lurking here.

Layer 2 devices can be parents just as well as layer 3 devices, with any 
level of redundancy anywhere you want.


>  With that in mind I believe there is a relatively 
> easy solution that stays true to nagios's current parent model while 
> still meeting this challenge.
> 
> The existing parent logic should be able to remain pretty much as is, 
> merely renaming the directive to "l3parents" to distinguish this should 
> only be used for layer 3 parents.
> 

But it shouldn't. Each electron that makes up a part of a packet has to 
traverse a chain of physical nodes to reach its destination. Each of 
those nodes is a parent to whatever node they send the electrons on to next.


> Duplicating the existing parents logic and assigning it a new name 
> called l2parents.  Nagios would then need to be modified to first check
> l2parents before proceeding to the l3parents when a device goes into a 
> NON-OK state.  If all l2 parents or l3 parents are down nagios would 
> follow the l2 or l3 inherited parents just as it does today.
> 

Gaining what? Here's how we set up it. Mindscale to any level and depth 
you like. I've never seen nagios do anything but The Right Thing with 
config like this.

nagios -> switch1 -> router -> switch2 -> host

define host {
	host_name switch1
}

define host {
	host_name router
	parents   switch1
}

define host {
	host_name switch2
	parents   router
}

define host {
	host_name host
	parents   switch2
}

> IMO this change would be the least intrusive, adds layer2 parent support 
> and allows for redundancy detection for both layer2 and layer3 devices 
> with little added complexity.
> 

If you're going to add all layer 2 devices in their own parenting link 
you need to know when layer 3 devices pop in between them, which means 
either keeping the current scheme alongside it or go looking for the 
object that has our current node as an immediate downstream child. 
Either way is just plain dumb.


> Side note: The 3d map should show the layer2 parents as being directly 
> connected to the child device.  The l3parents should only connected to 
> devices where their layer2 and layer3 parents are the same NAME/IP.  In 
> this way you would see a server connected to a switch that is in turn 
> connected to another switch which then connects to the layer3 device, 
> which so happens is how the physical connectivity IS setup in reality.
> 

Use my example from above and this is exactly what you get.



Usually when you post to a developer forum for the first time it's a 
good idea to browse the list archives for both the developer and the 
user forum. Had you done so, with just a few well-chosen searches, you 
would have spared yourself the indignity of letting this particular 
piece of digital pollution hit the internet.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click




More information about the Developers mailing list