Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

Doug Eubanks admin at dougware.net
Thu May 23 21:04:52 CEST 2013


I ran into a similar problem, because my template set the service to "*
is_volatile=1*".

http://nagios.sourceforge.net/docs/3_0/volatileservices.html

Check to see if you have this flag enabled.

Doug

Sincerely,
Doug Eubanks
admin at dougware.net
K1DUG
(919) 201-8750


On Thu, May 23, 2013 at 11:43 AM, C. Bensend <benny at bennyvision.com> wrote:

>
> Hey folks,
>
>    I recently made two major changes to my Nagios environment:
>
> 1) I upgraded to v3.5.0.
> 2) I moved from a single server to two pollers sending passive
>    results to one central console server.
>
>    Now, this new distributed system was in place for several months
> while I tested, and it worked fine.  HOWEVER, since this was running
> in parallel with my production system, notifications were disabled.
> Hence, I didn't see this problem until I cut over for real and
> enabled notifications.
>
> (please excuse any cut-n-paste ugliness, had to send this info from
> my work account via Outlook and then try to cleanse and reformat
> via Squirrelmail)
>
>    As a test and to capture information, I reboot 'hostname'.  This
> log is from the nagios-console host, which is the host that accepts
> the passive check results and sends notifications.  Here is the
> console host receiving a service check failure when the host is
> restarting:
>
> May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
> queue;CRITICAL;SOFT;1;Connection refused by host
>
>
> So, the distributed poller system checks the host and sends its
> results to the console server:
>
> May 22 15:57:30 nagios-console nagios: HOST
> ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)
>
>
> And then the centralized server IMMEDIATELY goes into a hard state,
> which triggers a  notification:
>
> May 22 15:57:30 nagios-console nagios: HOST ALERT:
> hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
> May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
> cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
> Host Unreachable (a.b.c.d)
>
>
>    Um.  Wat?  Why would the console immediately trigger a hard
> state? The config files don't support this decision.  And this
> IS a problem with the console server - the distributed monitors
> continue checking the host for 6 times like they should.  But
> for some reason, the centralized console just immediately
> calls it a hard state.
>
>    Definitions on the distributed monitoring host (the one running
> the actual host and service checks for this host 'hostname':
>
> define host {
>      host_name                hostname
>      alias                    Old production Nagios server
>      address                  a.b.c.d
>      action_url               /pnp4nagios/graph?host=$HOSTNAME$
>      icon_image_alt           Red Hat Linux
>      icon_image               redhat.png
>      statusmap_image          redhat.gd2
>      check_command            check-host-alive
>      check_period             24x7
>      notification_period      24x7
>      contact_groups           linux-infrastructure-admins
>      use                      linux-host-template
> }
>
> The linux-host-template on that same system:
>
> define host {
>      name                     linux-host-template
>      register                 0
>      max_check_attempts       6
>      check_interval           5
>      retry_interval           1
>      notification_interval    360
>      notification_options     d,r
>      active_checks_enabled    1
>      passive_checks_enabled   1
>      notifications_enabled    1
>      check_freshness          0
>      check_period             24x7
>      notification_period      24x7
>      check_command            check-host-alive
>      contact_groups           linux-infrastructure-admins
> }
>
> And said command to determine up or down:
>
> define command {
>      command_name             check-host-alive
>      command_line             $USER1$/check_ping -H $HOSTADDRESS$ -w
> 5000.0,80% -c 10000.0,100% -p 5
> }
>
>
> Definitions on the centralized console host (the one that notifies):
>
> define host {
>       host_name                hostname
>       alias                    Old production Nagios server
>       address                  a.b.c.d
>       action_url               /pnp4nagios/graph?host=$HOSTNAME$
>       icon_image_alt           Red Hat Linux
>       icon_image               redhat.png
>       statusmap_image          redhat.gd2
>       check_command            check-host-alive
>       check_period             24x7
>       notification_period      24x7
>       contact_groups           linux-infrastructure-admins
>       use                      linux-host-template,Default_monitor_server
> }
>
> The "Default monitor server" template on the centralized server:
>
> define host {
>       name                     Default_monitor_server
>       register                 0
>       active_checks_enabled    0
>       passive_checks_enabled   1
>       notifications_enabled    1
>       check_freshness          0
>       freshness_threshold      86400
> }
>
> And the linux-host-template template on that same centralized host:
>
> define host {
>        name                    linux-host-template
>        register                0
>        max_check_attempts      6
>        check_interval          5
>        retry_interval          1
>        notification_interval   360
>        notification_options    d,r
>        active_checks_enabled   1
>        passive_checks_enabled  1
>        notifications_enabled   1
>        check_freshness         0
>        check_period            24x7
>        notification_period     24x7
>        check_command           check-host-alive
>        contact_groups          linux-infrastructure-admins
> }
>
>
>    This is causing some real problems:
>
> 1) If a single host polling cycle has a blip, it notifies
>    IMMEDIATELY.
> 2) Because it notifies immediately, it ignores host dependencies.
>    So, when a WAN link goes down for example, it fires off
>    notifications for *all* hosts at that site as fast as it can,
>    when it should be retrying, and then walking the dependency tree.
>
>    I do have translate_passive_host_checks=1 on the centralized
> monitor, but the way I understand it, that shouldn't effect a
> state going from SOFT to HARD.  Am I misinterpreting this?
>
>    Another variable - I'm using NConf for the configuration management,
> and it does some templating tricks to help with the distributed
> monitoring setup.  But, all it does is generate config files, and I
> don't see any evidence in the configs as to why this would be
> happening.
>
> Any help would be greatly appreciated!
>
> Benny
>
>
> --
> "The very existence of flamethrowers proves that sometime, somewhere,
> someone said to themselves, 'You know, I want to set those people
> over there on fire, but I'm just not close enough to get the job
> done.'"                          -- George Carlin
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20130523/aa0d3c6d/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list