passive on-demand host checks being converted from soft to hard

Frost, Mark {PBG} mark.frost1 at pepsi.com
Mon Jul 7 22:38:22 CEST 2008


I've been seeing a problem with on-demand host checking since we moved
to a distributed setup.  We're running Nagios 3.0.2 with a central
server that does virtually no checks.  All checks are performed by 2
other distributed servers.

I have an example situation here where the distributed node detects a
service failure then host failure.  On the distributed node, I see:

	Host Down[07-07-2008 15:30:44] HOST ALERT:
mfrost_win;DOWN;HARD;10;FPING CRITICAL - PB9700DL1JDGHD1.corp.pep.pvt
(loss=100% )
	Host Down[07-07-2008 15:29:42] HOST ALERT:
mfrost_win;DOWN;SOFT;9;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:28:40] HOST ALERT:
mfrost_win;DOWN;SOFT;8;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:27:38] HOST ALERT:
mfrost_win;DOWN;SOFT;7;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:26:36] HOST ALERT:
mfrost_win;DOWN;SOFT;6;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:25:34] HOST ALERT:
mfrost_win;DOWN;SOFT;5;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:24:32] HOST ALERT:
mfrost_win;DOWN;SOFT;4;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:23:30] HOST ALERT:
mfrost_win;DOWN;SOFT;3;FPING CRITICAL - mfrost_win (loss=100% )
	Host Down[07-07-2008 15:22:28] HOST ALERT:
mfrost_win;DOWN;SOFT;2;FPING CRITICAL - mfrost_win (loss=100% )
	Service Critical[07-07-2008 15:22:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
	Host Down[07-07-2008 15:21:26] HOST ALERT:
mfrost_win;DOWN;SOFT;1;FPING CRITICAL - mfrost_win (loss=100% )
	Service Critical[07-07-2008 15:21:24] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.

But for the corresponding set of activities I see the following on the
central/reporting server:

	Service Critical[07-07-2008 15:22:29] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout
after 10 seconds.
	Host Down[07-07-2008 15:21:33] HOST ALERT:
mfrost_win;DOWN;HARD;1;FPING CRITICAL - mfrost_win (loss=100% )
	Service Critical[07-07-2008 15:21:33] SERVICE ALERT:
mfrost_win;C: Drive Space;CRITICAL;SOFT;1;CHECK_NRPE: Socket timeout
after 10 seconds.

The distributed node seems to do what its supposed to do and continues
to retry up to max_retries (10).  When that first (soft) ping failure
gets passed to the central/reporting server, it marks it as a
hard/critical and sends an alert out immediately.  Meanwhile the
distributed node continues checking for a while until it determines that
the state of the host is hard/critical.

The settings for this host are as follows:

central server:
	max_check_attempts     10
	check_interval         0
	retry_interval         1
	obsess_over_host       0
	active_checks_enabled  0
	passive_checks_enabled 1
	check_freshness        1
	freshness_threshold    1200

distributed node:
	max_check_attempts     10
	check_interval         0
	retry_interval         1
	obsess_over_host       1
	active_checks_enabled  1
	passive_checks_enabled 0
	check_freshness        0
	freshness_threshold    1200


Everything else works fine monitoring-wise, but this problem has been
bugging me for months now.  I'm at that crossroads where I'm trying to
determine if this is a bug or if I'm doing something wrong that I can't
figure out.  As far as I can glean from the documentation, this isn't
how this is supposed to work given the way I've configured things.

Thanks

Mark


-------------------------------------------------------------------------
Sponsored by: SourceForge.net Community Choice Awards: VOTE NOW!
Studies have shown that voting for your favorite open source project,
along with a healthy diet, reduces your potential for chronic lameness
and boredom. Vote Now at http://www.sourceforge.net/community/cca08




More information about the Developers mailing list