Nagios v3.5.0 transitioning immediately to a HARD state upon host problem

Andreas Ericsson ae at op5.se
Fri May 24 09:42:37 CEST 2013


On 2013-05-23 17:43, C. Bensend wrote:
>
> Hey folks,
>
>     I recently made two major changes to my Nagios environment:
>
> 1) I upgraded to v3.5.0.
> 2) I moved from a single server to two pollers sending passive
>     results to one central console server.
>
>     Now, this new distributed system was in place for several months
> while I tested, and it worked fine.  HOWEVER, since this was running
> in parallel with my production system, notifications were disabled.
> Hence, I didn't see this problem until I cut over for real and
> enabled notifications.
>
> (please excuse any cut-n-paste ugliness, had to send this info from
> my work account via Outlook and then try to cleanse and reformat
> via Squirrelmail)
>
>     As a test and to capture information, I reboot 'hostname'.  This
> log is from the nagios-console host, which is the host that accepts
> the passive check results and sends notifications.  Here is the
> console host receiving a service check failure when the host is
> restarting:
>
> May 22 15:57:10 nagios-console nagios: SERVICE ALERT: hostname;/var disk
> queue;CRITICAL;SOFT;1;Connection refused by host
>
>
> So, the distributed poller system checks the host and sends its
> results to the console server:
>
> May 22 15:57:30 nagios-console nagios: HOST
> ALERT:hostname;DOWN;SOFT;1;CRITICAL - Host Unreachable (a.b.c.d)
>
>
> And then the centralized server IMMEDIATELY goes into a hard state,
> which triggers a  notification:
>
> May 22 15:57:30 nagios-console nagios: HOST ALERT:
> hostname;DOWN;HARD;1;CRITICAL - Host Unreachable (a.b.c.d)
> May 22 15:57:30 nagios-console nagios: HOST NOTIFICATION:
> cbensend;hostname;DOWN;host-notify-by-email-test;CRITICAL -
> Host Unreachable (a.b.c.d)
>
>
>     Um.  Wat?  Why would the console immediately trigger a hard
> state? The config files don't support this decision.  And this
> IS a problem with the console server - the distributed monitors
> continue checking the host for 6 times like they should.  But
> for some reason, the centralized console just immediately
> calls it a hard state.
>
>     Definitions on the distributed monitoring host (the one running
> the actual host and service checks for this host 'hostname':
>
> define host {
>       host_name                hostname
>       alias                    Old production Nagios server
>       address                  a.b.c.d
>       action_url               /pnp4nagios/graph?host=$HOSTNAME$
>       icon_image_alt           Red Hat Linux
>       icon_image               redhat.png
>       statusmap_image          redhat.gd2
>       check_command            check-host-alive
>       check_period             24x7
>       notification_period      24x7
>       contact_groups           linux-infrastructure-admins
>       use                      linux-host-template
> }
>
> The linux-host-template on that same system:
>
> define host {
>       name                     linux-host-template
>       register                 0
>       max_check_attempts       6
>       check_interval           5
>       retry_interval           1
>       notification_interval    360
>       notification_options     d,r
>       active_checks_enabled    1
>       passive_checks_enabled   1
>       notifications_enabled    1
>       check_freshness          0
>       check_period             24x7
>       notification_period      24x7
>       check_command            check-host-alive
>       contact_groups           linux-infrastructure-admins
> }
>
> And said command to determine up or down:
>
> define command {
>       command_name             check-host-alive
>       command_line             $USER1$/check_ping -H $HOSTADDRESS$ -w
> 5000.0,80% -c 10000.0,100% -p 5
> }
>
>
> Definitions on the centralized console host (the one that notifies):
>
> define host {
>        host_name                hostname
>        alias                    Old production Nagios server
>        address                  a.b.c.d
>        action_url               /pnp4nagios/graph?host=$HOSTNAME$
>        icon_image_alt           Red Hat Linux
>        icon_image               redhat.png
>        statusmap_image          redhat.gd2
>        check_command            check-host-alive
>        check_period             24x7
>        notification_period      24x7
>        contact_groups           linux-infrastructure-admins
>        use                      linux-host-template,Default_monitor_server
> }
>
> The "Default monitor server" template on the centralized server:
>
> define host {
>        name                     Default_monitor_server
>        register                 0
>        active_checks_enabled    0
>        passive_checks_enabled   1
>        notifications_enabled    1
>        check_freshness          0
>        freshness_threshold      86400
> }
>
> And the linux-host-template template on that same centralized host:
>
> define host {
>         name                    linux-host-template
>         register                0
>         max_check_attempts      6
>         check_interval          5
>         retry_interval          1
>         notification_interval   360
>         notification_options    d,r
>         active_checks_enabled   1
>         passive_checks_enabled  1
>         notifications_enabled   1
>         check_freshness         0
>         check_period            24x7
>         notification_period     24x7
>         check_command           check-host-alive
>         contact_groups          linux-infrastructure-admins
> }
>
>
>     This is causing some real problems:
>
> 1) If a single host polling cycle has a blip, it notifies
>     IMMEDIATELY.
> 2) Because it notifies immediately, it ignores host dependencies.
>     So, when a WAN link goes down for example, it fires off
>     notifications for *all* hosts at that site as fast as it can,
>     when it should be retrying, and then walking the dependency tree.
>
>     I do have translate_passive_host_checks=1 on the centralized
> monitor, but the way I understand it, that shouldn't effect a
> state going from SOFT to HARD.  Am I misinterpreting this?
>
>     Another variable - I'm using NConf for the configuration management,
> and it does some templating tricks to help with the distributed
> monitoring setup.  But, all it does is generate config files, and I
> don't see any evidence in the configs as to why this would be
> happening.
>
> Any help would be greatly appreciated!
>

Set passive_host_checks_are_soft=1 in nagios.cfg on your master
server and things should start working as intended.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list