DNS down and false alerts...

Randal, Phil prandal at herefordshire.gov.uk
Tue Jun 9 17:42:34 CEST 2009


Option 5:  Install a local caching DNS server on your nagios box, and
put 127.0.0.1 at the top of resolv.conf.
 
Cheers,
 
Phil
-- 
Phil Randal | Networks Engineer 
Herefordshire Council | Deputy Chief Executive's Office | I.C.T.
Services Division 
Thorn Office Centre, Rotherwas, Hereford, HR2 6JT 
Tel: 01432 260160 
email: prandal at herefordshire.gov.uk 

Any opinion expressed in this e-mail or any attached files are those of
the individual and not necessarily those of Herefordshire Council.

This e-mail and any attached files are confidential and intended solely
for the use of the addressee. This communication may contain material
protected by law from being passed on. If you are not the intended
recipient and have received this e-mail in error, you are advised that
any use, dissemination, forwarding, printing or copying of this e-mail
is strictly prohibited. If you have received this e-mail in error please
contact the sender immediately and destroy all copies of it.

 

________________________________

From: Andrew Davis [mailto:nccomp at gmail.com] 
Sent: 09 June 2009 16:19
To: nagios-users at lists.sourceforge.net
Subject: [Nagios-users] DNS down and false alerts...


I've observed an interesting issue with Nagios. Our environment is a mix
of UNIX, Linux, Apple, and Windows. The core of the network is Active
Directory including two AD servers that are both our primary, internal
DNS servers. All non-Windows systems have a resolv.conf that looks like:


	nameserver 10.1.1.13
	nameserver 10.1.1.14
	domain int.our.domain
	search int.our.domain
	

About half of the servers have the nameserver entries inverted (ie: .14
first, .13 second).

The issue is that anytime one of the nameservers is rebooted (at least
once a month if staying current on patches thanks to Black Tuesdays),
whichever hosts have that nameserver listed first in its resolv.conf
start throwing the following errors:


	CRITICAL - Plugin timed out while executing system call.
	

This occurs for multiple tests for each host. Obviously, there's a name
resolution correlation here. If the nameserver with .13 is rebooted, all
hosts (about half of them) that list this IP first in their resolve.conf
then timeout for multiple tests. If the .14 server is rebooted, all the
other hosts timeout. Interestingly, none of the Windows clients issue
errors... only UNIX, Linux, and Mac's... only those with an
/etc/resolv.conf. The end result is a host of "false positives", but
more importantly it looks bad on availability reports and causes
phones/pagers to go ballistic with unneeded emails.

I'm trying to find a solution and I can't find one that I like:

Solution 1) is to cluster the DNS servers. We have lots of clusters
here. This isn't good, though, as you don't normally cluster DNS
servers... they're meant to be redundant for a reason... one fails and
it uses the next one.

Solution 2) is to setup a service/host dependency. My thought would be
either a host dependency that says if either .13 or .14 are down, then
don't alert for any other host that uses them. Or a service to host
dependency... if the DNS service is down, then don't alert on any of
these dependent hosts. Honestly, I'm not sure if you can mix host and
service dependencies like this... plus... if the DNS server is actually
down, then the DNS service is down, so better to use a host dependency.
The problem is that now we're not alerting on any dependent hosts which
themselves could have a legitimate issue we want to know about. Plus,
what happens if the DNS server actually dies and take a few hours/days
to rebuild/restore? At this point, the dependent hosts aren't watched
for a very long time.

Solution 3) is to setup a UNIX/Linux DNS server that slaves all zones
from the AD servers and have all UNIX/Linux/Apple clients query from
this server. This would work except that A) I need two of them to keep
redundancy and B) I've now added an extra layer of complication to
resolve an application (Nagios)... not exactly good practice.

Solution 4) is to set the timeout value of a host querying a DNS server.
Perhaps adjust the client to timeout on the first listed nameserver
after only 10 seconds, then try the next one? Since most Nagios tests
have a minimum timeout value of 30 seconds, if the first DNS query timed
out after 10 seconds, it would go to the next one with, hopefully,
enough time to respond. The downside is having to adjust every single
server.

Has anyone else seen this? Anyone else using Windows AD servers to
provide DNS for *nix servers? 

-- 


  A. Davis
  Email:     nccomp at gmail.com

  "There is no limit to what a man can accomplish
   if he doesn't care who gets the credit." - Ronald Reagan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090609/73c65f5b/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list