Dependency problem

Anastasios Zafeiropoulos mls at freemail.gr
Thu Apr 8 19:33:06 CEST 2004




Mr Tedman, thank you very much for your response, throughout this kind of flame.

I will disagree with you regarding the max_check_attempts = 30. This is tested and works as it should work. When the RT3 is unreachable or down, it will start pinging with a 30 limit countdouwn. When it will end with no success, it will jump to its parent. And so on...

But I think that you gave me a new kick start with the escalations thing. I 'd better go read a liitle bit more the documentation and see if this option works for my case!

Thnks again
  ----- Original Message ----- 
  From: Tedman Eng 
  To: 'Anastasios Zafeiropoulos' ; nagios-users 
  Sent: Thursday, April 08, 2004 11:15 AM
  Subject: RE: [Nagios-users] Dependency problem


  Try lowering the host max_check_attempts.  When nagios detects a service is bad, it'll hostcheck each parent up the tree and will not do ANYTHING for the 30 check attempts you've set while it tries to determine whether RT1, RT2, and/or RT3 is down.  This can adversely affect your other monitored devices if those links are always flapping.  It's better to monitor faster and make notifications slower than to slow down the entire monitoring.  The host will show up in the console as up/down/flapping a lot, which is its true state.  You can artificially slow down notifications by using escalations.

  For example:
  set notification interval to 5
  set no contact for the normal notification (use the escalation instead)
  set the escalation to notify starting at alert #2

  This would in effect make it so the device would have to be down for a full 5 minutes before you get notified.



   -----Original Message-----
  From: Anastasios Zafeiropoulos [mailto:mls at freemail.gr]
  Sent: Wednesday, April 07, 2004 12:59 PM
  To: nagios-users
  Subject: [Nagios-users] Dependency problem


    Hello world,

    I'm having trouble with a Host dependency misconfiguration or why not, with a bug in Nagios' Dependency logic process and 

    notification.

    I am using version nagios-1.2-0.rhfc1.dag which was a prebuilt package from Dag Apt repository site.
    ===================================================
    My Topology:
    ===================================================

    Nagios machine --- RT1 -- RT2 -- RT3 


    ====================================================
    The problem
    ====================================================

    When RT1 goes down, or the RT1-RT2 Link goes down, Nagios will notice that at random, while he is checkong a service or 

    HOST_ALIVE function to any part of the network that is down. Let's assume that the first Host that Nagios found dead was RT3. 

    Nagios didn't get any reply from RT3, so RT3 will be kept in SOFT down state. 

    Next the RETRY proccess will take place. The max_check_attempts are 30 for each host. That's because the links are not 

    reliable at all so we want to be a little elastic with the Notifications.

    At the time that we reach the Retry #30, Nagios assumes that RT3 IS DOWN, puts it in HARD DOWN state and looks to find any 

    dependencies associated with the RT3. If you look below, RT3 is dependent upon RT2. So it will continue with try pinging RT2.

    While Nagios is trying to determine whether the RT2 is alive or not, suddendly, the RT1-RT2 link comes up and all the network 

    is now reachable by Nagios. I notice here that the max_checks_attempts havent timed out. So Nagios will take a response from 

    RT2 and it will put it in A HARD OK State.

    The result will be NOT to check RT3 again to see if he is up as RT2. So, a notification will be sent reporting that RT3 is 

    down. This is FAKE. The whole network was down!

    Below I provide you my configuration. Maybe sth goes wrong with my conf files.

    Thanks in advance guys

    ====================================================
    My dependecies.cfg file
    ====================================================

    define hostdependency{
     host_name   RT2
     dependent_host_name  RT3
     notification_failure_criteria d,u
     }

    define hostdependency{
     host_name   RT1
     dependent_host_name  RT2
     notification_failure_criteria d,u
     }


    ===================================================
    My hosts.cfg
    ===================================================

    define host{
     use   generic-host
     host_name  RT1
     alias   Wireless 1
     address   213.5.0.34
     check_command  check-host-alive
     max_check_attempts  30
     notification_interval 0
     notification_period 24x7
     notification_options d,u
     }


    define host{
     use   generic-host
     host_name  RT2
     alias   tsapi.twmn
     address   10.107.13.1
     parents   RT1
     check_command  check-host-alive
     max_check_attempts  30
     notification_interval 0
     notification_period 24x7
     notification_options d,u
     }


    define host{
     use   generic-host
     host_name  RT3
     alias   Wireless Internet
     address   212.34.23.4
     parents   RT2
     check_command  check-host-alive
     max_check_attempts  30
     notification_interval 0
     notification_period 24x7
     notification_options d,u
     }



    ____________________________________________________________________
    http://www.freemail.gr - δωρεάν υπηρεσία ηλεκτρονικού ταχυδρομείου.
    http://www.freemail.gr - free email service for the Greek-speaking.


____________________________________________________________________
http://www.freemail.gr - δωρεάν υπηρεσία ηλεκτρονικού ταχυδρομείου.
http://www.freemail.gr - free email service for the Greek-speaking.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20040408/3a5270e2/attachment.html>


More information about the Users mailing list