Improving the host <parents> logic

Shane Stixrud shane at geeklords.org
Wed Dec 14 22:35:42 CET 2005


Nagios's host parent logic is good but it could be a whole lot better for 
todays switched networks.  There has been a couple of recommendations in 
the past on how to improve this.

1) Allow nagios admins to change parent logic failure detection in cases 
where one parent is up but others are down.  By default nagios treats 
multiple parents as redundant paths and thus does not suppress 
notification in situations where at least one parent is OK.

The main disadvantage to this proposal is nagios rightly treats 
parents as directly connected HOPs on the path back to nagios.  This work 
around would treat switches and routers as peers when they are not, 
removing the possibility of redundancy detection and easily determining 
which device is at fault.

2) Allow the nagios admins to assign a weighted priority to each host and 
have a system that allows the admin to tune these values to suppress 
notification where appropriate.

This type of solution in IMO is way more complex than is required, the 
best part of the current solution is its simple to management 
and obvious to deploy.

The main problem with the existing solution is modern switched networks 
often have A LOT of managed nodes connected to one or more layer2 
switches in the same layer3 network.  Ideally nagios would allow admins to 
suppress notification for both devices behind both layer2 devices and 
layer3 interfaces.  With that in mind I believe there is a relatively easy 
solution that stays true to nagios's current parent model while still 
meeting this challenge.

The existing parent logic should be able to remain pretty much as is, 
merely renaming the directive to "l3parents" to distinguish this 
should only be used for layer 3 parents.

Duplicating the existing parents logic and assigning it a new name 
called l2parents.  Nagios would then need to be modified to first check
l2parents before proceeding to the l3parents when a device goes into 
a NON-OK state.  If all l2 parents or l3 parents are down nagios would 
follow the l2 or l3 inherited parents just as it does today.

IMO this change would be the least intrusive, adds layer2 parent support 
and allows for redundancy detection for both layer2 and layer3 devices 
with little added complexity.

Side note: The 3d map should show the layer2 parents as being 
directly connected to the child device.  The l3parents should only 
connected to devices where their layer2 and layer3 parents are the same 
NAME/IP.  In this way you would see a server connected to a switch that is 
in turn connected to another switch which then connects to the layer3 
device, which so happens is how the physical connectivity IS setup in 
reality.

Cheers,
Shane


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click




More information about the Developers mailing list