Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

C. Bensend benny at bennyvision.com
Thu May 23 17:43:49 CEST 2013


Hey folks,

   I recently made two major changes to my Nagios environment:

1) I upgraded to v3.5.0.
2) I moved from a single server to two pollers sending passive
   results to one central console server.

   Now, this new distributed system was in place for several months
while I tested, and it worked fine.  HOWEVER, since this was running
in parallel with my production system, notifications were disabled.
Hence, I didn't see this problem until I cut over for real and
enabled notifications.

(please excuse any cut-n-paste ugliness, had to send this info from
my work account via Outlook and then try to cleanse and reformat
via Squirrelmail)

   As a test and to capture information, I reboot 'hostname'.  This
log is from the nagios-console host, which is the host that accepts
the passive check results and sends notifications.  Here is the
console host receiving a service check failure when the host is
restarting:

May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
queue;CRITICAL;SOFT;1;Connection refused by host


So, the distributed poller system checks the host and sends its
results to the console server:

May 22 15:57:30 nagios-console nagios: HOST
ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)


And then the centralized server IMMEDIATELY goes into a hard state,
which triggers a  notification:

May 22 15:57:30 nagios-console nagios: HOST ALERT:
hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
Host Unreachable (a.b.c.d)


   Um.  Wat?  Why would the console immediately trigger a hard
state? The config files don't support this decision.  And this
IS a problem with the console server - the distributed monitors
continue checking the host for 6 times like they should.  But
for some reason, the centralized console just immediately
calls it a hard state.

   Definitions on the distributed monitoring host (the one running
the actual host and service checks for this host 'hostname':

define host {
     host_name                hostname
     alias                    Old production Nagios server
     address                  a.b.c.d
     action_url               /pnp4nagios/graph?host=$HOSTNAME$
     icon_image_alt           Red Hat Linux
     icon_image               redhat.png
     statusmap_image          redhat.gd2
     check_command            check-host-alive
     check_period             24x7
     notification_period      24x7
     contact_groups           linux-infrastructure-admins
     use                      linux-host-template
}

The linux-host-template on that same system:

define host {
     name                     linux-host-template
     register                 0
     max_check_attempts       6
     check_interval           5
     retry_interval           1
     notification_interval    360
     notification_options     d,r
     active_checks_enabled    1
     passive_checks_enabled   1
     notifications_enabled    1
     check_freshness          0
     check_period             24x7
     notification_period      24x7
     check_command            check-host-alive
     contact_groups           linux-infrastructure-admins
}

And said command to determine up or down:

define command {
     command_name             check-host-alive
     command_line             $USER1$/check_ping -H $HOSTADDRESS$ -w
5000.0,80% -c 10000.0,100% -p 5
}


Definitions on the centralized console host (the one that notifies):

define host {
      host_name                hostname
      alias                    Old production Nagios server
      address                  a.b.c.d
      action_url               /pnp4nagios/graph?host=$HOSTNAME$
      icon_image_alt           Red Hat Linux
      icon_image               redhat.png
      statusmap_image          redhat.gd2
      check_command            check-host-alive
      check_period             24x7
      notification_period      24x7
      contact_groups           linux-infrastructure-admins
      use                      linux-host-template,Default_monitor_server
}

The "Default monitor server" template on the centralized server:

define host {
      name                     Default_monitor_server
      register                 0
      active_checks_enabled    0
      passive_checks_enabled   1
      notifications_enabled    1
      check_freshness          0
      freshness_threshold      86400
}

And the linux-host-template template on that same centralized host:

define host {
       name                    linux-host-template
       register                0
       max_check_attempts      6
       check_interval          5
       retry_interval          1
       notification_interval   360
       notification_options    d,r
       active_checks_enabled   1
       passive_checks_enabled  1
       notifications_enabled   1
       check_freshness         0
       check_period            24x7
       notification_period     24x7
       check_command           check-host-alive
       contact_groups          linux-infrastructure-admins
}


   This is causing some real problems:

1) If a single host polling cycle has a blip, it notifies
   IMMEDIATELY.
2) Because it notifies immediately, it ignores host dependencies.
   So, when a WAN link goes down for example, it fires off
   notifications for *all* hosts at that site as fast as it can,
   when it should be retrying, and then walking the dependency tree.

   I do have translate_passive_host_checks=1 on the centralized
monitor, but the way I understand it, that shouldn't effect a
state going from SOFT to HARD.  Am I misinterpreting this?

   Another variable - I'm using NConf for the configuration management,
and it does some templating tricks to help with the distributed
monitoring setup.  But, all it does is generate config files, and I
don't see any evidence in the configs as to why this would be
happening.

Any help would be greatly appreciated!

Benny


-- 
"The very existence of flamethrowers proves that sometime, somewhere,
someone said to themselves, 'You know, I want to set those people
over there on fire, but I'm just not close enough to get the job
done.'"                          -- George Carlin





------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list