Dependency problem

Tedman Eng teng at dataway.com
Thu Apr 8 21:21:49 CEST 2004


The problem with having a high host max check attempts is that ALL other
functions stop during this particular type of check.  This is by design,
since Nagios has to determine how widespread an outage is, as it walks the
dependency tree using host checks.  The monitoring queue gets pushed back,
and everyone's check latency suffers.  With a lower max_check_attempt,
you'll know sooner when the bad link is acting up (in the console), and
notification spam can be tuned (using escalations) if needed.  In the end,
whatever works well for you is what you should do.  Our shop has had good
results with this setup.
 
Another avenue to look into is defining a more lenient check-host-alive
command.  Our company manages some networks throughout the world with very
poor links, ping times often exceeding 4000ms ( 4 seconds!! ).  So, we
cloned the check-host-alive into check-host-alive2, depending on their ISP
line quality.

-----Original Message-----
From: Anastasios Zafeiropoulos [mailto:mls at freemail.gr]
Sent: Thursday, April 08, 2004 10:33 AM
To: Tedman Eng; nagios-users
Subject: Re: [Nagios-users] Dependency problem


 
 
 
Mr Tedman, thank you very much for your response, throughout this kind of
flame.
 
I will disagree with you regarding the max_check_attempts = 30. This is
tested and works as it should work. When the RT3 is unreachable or down, it
will start pinging with a 30 limit countdouwn. When it will end with no
success, it will jump to its parent. And so on...
 
But I think that you gave me a new kick start with the escalations thing. I
'd better go read a liitle bit more the documentation and see if this option
works for my case!
 
Thnks again

----- Original Message ----- 
From: Tedman Eng <mailto:teng at dataway.com>  
To: 'Anastasios Zafeiropoulos' <mailto:mls at freemail.gr>  ; nagios-users
<mailto:nagios-users at lists.sourceforge.net>  
Sent: Thursday, April 08, 2004 11:15 AM
Subject: RE: [Nagios-users] Dependency problem

Try lowering the host max_check_attempts.  When nagios detects a service is
bad, it'll hostcheck each parent up the tree and will not do ANYTHING for
the 30 check attempts you've set while it tries to determine whether RT1,
RT2, and/or RT3 is down.  This can adversely affect your other monitored
devices if those links are always flapping.  It's better to monitor faster
and make notifications slower than to slow down the entire monitoring.  The
host will show up in the console as up/down/flapping a lot, which is its
true state.  You can artificially slow down notifications by using
escalations.
 
For example:
set notification interval to 5
set no contact for the normal notification (use the escalation instead)
set the escalation to notify starting at alert #2
 
This would in effect make it so the device would have to be down for a full
5 minutes before you get notified.
 
 
 
 -----Original Message-----
From: Anastasios Zafeiropoulos [mailto:mls at freemail.gr]
Sent: Wednesday, April 07, 2004 12:59 PM
To: nagios-users
Subject: [Nagios-users] Dependency problem



Hello world,
 
I'm having trouble with a Host dependency misconfiguration or why not, with
a bug in Nagios' Dependency logic process and 
 
notification.
 
I am using version nagios-1.2-0.rhfc1.dag which was a prebuilt package from
Dag Apt repository site.
===================================================
My Topology:
===================================================
 
Nagios machine --- RT1 -- RT2 -- RT3 
 


====================================================
The problem
====================================================
 
When RT1 goes down, or the RT1-RT2 Link goes down, Nagios will notice that
at random, while he is checkong a service or 
 
HOST_ALIVE function to any part of the network that is down. Let's assume
that the first Host that Nagios found dead was RT3. 
 
Nagios didn't get any reply from RT3, so RT3 will be kept in SOFT down
state. 
 
Next the RETRY proccess will take place. The max_check_attempts are 30 for
each host. That's because the links are not 
 
reliable at all so we want to be a little elastic with the Notifications.
 
At the time that we reach the Retry #30, Nagios assumes that RT3 IS DOWN,
puts it in HARD DOWN state and looks to find any 
 
dependencies associated with the RT3. If you look below, RT3 is dependent
upon RT2. So it will continue with try pinging RT2.
 
While Nagios is trying to determine whether the RT2 is alive or not,
suddendly, the RT1-RT2 link comes up and all the network 
 
is now reachable by Nagios. I notice here that the max_checks_attempts
havent timed out. So Nagios will take a response from 
 
RT2 and it will put it in A HARD OK State.
 


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20040408/8551c63b/attachment.html>


More information about the Users mailing list