DNS down and false alerts...

Andrew Davis nccomp at gmail.com
Tue Jun 9 19:26:01 CEST 2009


Hey... I'm the OP. We're using a mix of client tools. For Windows 
systems (which aren't affected by this) we use nsclient++. For our Linux 
servers, NRPE... for UNIX (Solaris) and OS X we're using check_by_ssh. 
Both the NRPE and check_by_ssh clients are affected by this.

I'm willing to give the caching nameserver on the server a try, but as 
others have noted, I don't think it will make a difference as its the 
local test on the client that's failing to resolv. I surely cannot do a 
caching nameserver setup on all clients...

  A. Davis
  Email:     nccomp at gmail.com

  "There is no limit to what a man can accomplish
   if he doesn't care who gets the credit." - Ronald Reagan



Martin Melin wrote:
> I don't know if I'm misreading the OP, but if the plugins start timing 
> out on only the boxes whose primary DNS is being rebooted, would 
> adding a caching DNS server to the Nagios box really make a difference?
>
> I think the root cause to these timeouts is that the Nagios plugin 
> timeout is happening before the connection to the primary DNS on the 
> target machine has a chance to time out and then connect to the 
> secondary DNS.
>
> The correct course of action to resolve this would be to either make 
> sure that the DNS connection on the target machines fail quicker, or 
> that Nagios/the plugin waits longer for a result from the check. The 
> DNS failover is working as designed here but you're not giving it 
> enough time to kick in.
>
> On Tue, Jun 9, 2009 at 5:37 PM, Russell Adams 
> <RLAdams at adamsinfoserv.com <mailto:RLAdams at adamsinfoserv.com>> wrote:
>
>     Really the best choice is to using caching DNS on the Nagios
>     server. I'd recommend dnsmasq, it just does caching locally without
>     needing to do big zone transfers. It has low overhead and simple
>     configuration as a result.
>
>     Enjoy.
>
>     On Tue, Jun 09, 2009 at 11:19:20AM -0400, Andrew Davis wrote:
>     > I've observed an interesting issue with Nagios. Our environment
>     is a mix
>     > of UNIX, Linux, Apple, and Windows. The core of the network is
>     Active
>     > Directory including two AD servers that are both our primary,
>     internal
>     > DNS servers. All non-Windows systems have a resolv.conf that
>     looks like:
>     >
>     >    *nameserver 10.1.1.13
>     >    nameserver 10.1.1.14
>     >    domain int.our.domain
>     >    search int.our.domain*
>     >
>     > About half of the servers have the nameserver entries inverted
>     (ie: .14
>     > first, .13 second).
>     >
>     > The issue is that anytime one of the nameservers is rebooted (at
>     least
>     > once a month if staying current on patches thanks to Black
>     Tuesdays),
>     > whichever hosts have that nameserver listed first in its resolv.conf
>     > start throwing the following errors:
>     >
>     >    *CRITICAL - Plugin timed out while executing system call.*
>     >
>     > This occurs for multiple tests for each host. Obviously, there's
>     a name
>     > resolution correlation here. If the nameserver with .13 is
>     rebooted, all
>     > hosts (about half of them) that list this IP first in their
>     resolve.conf
>     > then timeout for multiple tests. If the .14 server is rebooted,
>     all the
>     > other hosts timeout. Interestingly, none of the Windows clients
>     issue
>     > errors... only UNIX, Linux, and Mac's... only those with an
>     > /etc/resolv.conf. The end result is a host of "false positives", but
>     > more importantly it looks bad on availability reports and causes
>     > phones/pagers to go ballistic with unneeded emails.
>     >
>     > I'm trying to find a solution and I can't find one that I like:
>     >
>     > Solution 1) is to cluster the DNS servers. We have lots of clusters
>     > here. This isn't good, though, as you don't normally cluster DNS
>     > servers... they're meant to be redundant for a reason... one
>     fails and
>     > it uses the next one.
>     >
>     > Solution 2) is to setup a service/host dependency. My thought
>     would be
>     > either a host dependency that says if either .13 or .14 are
>     down, then
>     > don't alert for any other host that uses them. Or a service to host
>     > dependency... if the DNS service is down, then don't alert on any of
>     > these dependent hosts. Honestly, I'm not sure if you can mix
>     host and
>     > service dependencies like this... plus... if the DNS server is
>     actually
>     > down, then the DNS service is down, so better to use a host
>     dependency.
>     > The problem is that now we're not alerting on any dependent
>     hosts which
>     > themselves could have a legitimate issue we want to know about.
>     Plus,
>     > what happens if the DNS server actually dies and take a few
>     hours/days
>     > to rebuild/restore? At this point, the dependent hosts aren't
>     watched
>     > for a very long time.
>     >
>     > Solution 3) is to setup a UNIX/Linux DNS server that slaves all
>     zones
>     > from the AD servers and have all UNIX/Linux/Apple clients query from
>     > this server. This would work except that A) I need two of them
>     to keep
>     > redundancy and B) I've now added an extra layer of complication to
>     > resolve an application (Nagios)... not exactly good practice.
>     >
>     > Solution 4) is to set the timeout value of a host querying a DNS
>     server.
>     > Perhaps adjust the client to timeout on the first listed nameserver
>     > after only 10 seconds, then try the next one? Since most Nagios
>     tests
>     > have a minimum timeout value of 30 seconds, if the first DNS
>     query timed
>     > out after 10 seconds, it would go to the next one with, hopefully,
>     > enough time to respond. The downside is having to adjust every
>     single
>     > server.
>     >
>     > Has anyone else seen this? Anyone else using Windows AD servers to
>     > provide DNS for *nix servers?
>     >
>     > --
>     >
>     >
>     >  A. Davis
>     >  Email:     nccomp at gmail.com <mailto:nccomp at gmail.com>
>     >
>     >  "There is no limit to what a man can accomplish
>     >   if he doesn't care who gets the credit." - Ronald Reagan
>     >
>
>     >
>     ------------------------------------------------------------------------------
>     > Crystal Reports - New Free Runtime and 30 Day Trial
>     > Check out the new simplified licensing option that enables unlimited
>     > royalty-free distribution of the report engine for externally facing
>     > server and web deployment.
>     > http://p.sf.net/sfu/businessobjects
>     > _______________________________________________
>     > Nagios-users mailing list
>     > Nagios-users at lists.sourceforge.net
>     <mailto:Nagios-users at lists.sourceforge.net>
>     > https://lists.sourceforge.net/lists/listinfo/nagios-users
>     > ::: Please include Nagios version, plugin version (-v) and OS
>     when reporting any issue.
>     > ::: Messages without supporting info will risk being sent to
>     /dev/null
>
>
>     ------------------------------------------------------------------
>     Russell Adams                            RLAdams at AdamsInfoServ.com
>
>     PGP Key ID:     0x1160DCB3           http://www.adamsinfoserv.com/
>
>     Fingerprint:    1723 D8CA 4280 1EC9 557F  66E8 1154 E018 1160 DCB3
>
>     ------------------------------------------------------------------------------
>     Crystal Reports - New Free Runtime and 30 Day Trial
>     Check out the new simplified licensing option that enables unlimited
>     royalty-free distribution of the report engine for externally facing
>     server and web deployment.
>     http://p.sf.net/sfu/businessobjects
>     _______________________________________________
>     Nagios-users mailing list
>     Nagios-users at lists.sourceforge.net
>     <mailto:Nagios-users at lists.sourceforge.net>
>     https://lists.sourceforge.net/lists/listinfo/nagios-users
>     ::: Please include Nagios version, plugin version (-v) and OS when
>     reporting any issue.
>     ::: Messages without supporting info will risk being sent to /dev/null
>
>
> ------------------------------------------------------------------------
>
> ------------------------------------------------------------------------------
> Crystal Reports - New Free Runtime and 30 Day Trial
> Check out the new simplified licensing option that enables unlimited
> royalty-free distribution of the report engine for externally facing 
> server and web deployment.
> http://p.sf.net/sfu/businessobjects
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20090609/eae92392/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Crystal Reports - New Free Runtime and 30 Day Trial
Check out the new simplified licensing option that enables unlimited
royalty-free distribution of the report engine for externally facing 
server and web deployment.
http://p.sf.net/sfu/businessobjects
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list