Restarts resetting soft critical states

Frost, Mark {PBG} mark.frost1 at pepsi.com
Thu Oct 29 20:50:22 CET 2009


You think you know an application and every once in a while you get a surprise...

I'm running Nagios 3.0.6 in a distributed configuration.  We had a host that was unpingable starting at 14:45.  It was configured to try the ping until it reached 10 failures, then send us an alert.  At the time this was going on, I was making some changes to the configuration (other hosts/services) and doing restarts to have the changes take effect.  These restarts would have occurred both on the distributed node and the reporting server.

>From looking at the history of this host, it appears that the soft criticals were logged, but each time the server was restarted, it reset the counter:

Host Down[10-29-2009 15:14:56] HOST ALERT: psplunk2;DOWN;HARD;10;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:13:26] HOST ALERT: psplunk2;DOWN;SOFT;9;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:12:12] HOST ALERT: psplunk2;DOWN;SOFT;8;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:10:44] HOST ALERT: psplunk2;DOWN;SOFT;7;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:09:30] HOST ALERT: psplunk2;DOWN;SOFT;6;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:08:24] HOST ALERT: psplunk2;DOWN;SOFT;5;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:07:08] HOST ALERT: psplunk2;DOWN;SOFT;4;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:06:04] HOST ALERT: psplunk2;DOWN;SOFT;3;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:04:56] HOST ALERT: psplunk2;DOWN;SOFT;2;FPING CRITICAL - psplunk2. (loss=100% )
Program Start[10-29-2009 15:03:42] Nagios 3.0.6 starting... (PID=932)
Program End[10-29-2009 15:03:41] Caught SIGTERM, shutting down...
Host Down[10-29-2009 15:02:33] HOST ALERT: psplunk2;DOWN;SOFT;7;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:01:23] HOST ALERT: psplunk2;DOWN;SOFT;6;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 15:00:05] HOST ALERT: psplunk2;DOWN;SOFT;5;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:59:05] HOST ALERT: psplunk2;DOWN;SOFT;4;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:57:57] HOST ALERT: psplunk2;DOWN;SOFT;3;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:56:55] HOST ALERT: psplunk2;DOWN;SOFT;2;FPING CRITICAL - psplunk2. (loss=100% )
Program Start[10-29-2009 14:56:19] Nagios 3.0.6 starting... (PID=31184)
Program End[10-29-2009 14:56:17] Caught SIGTERM, shutting down...
Host Down[10-29-2009 14:55:47] HOST ALERT: psplunk2;DOWN;SOFT;6;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:54:37] HOST ALERT: psplunk2;DOWN;SOFT;5;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:53:27] HOST ALERT: psplunk2;DOWN;SOFT;4;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:52:17] HOST ALERT: psplunk2;DOWN;SOFT;3;FPING CRITICAL - psplunk2. (loss=100% )
Host Down[10-29-2009 14:51:17] HOST ALERT: psplunk2;DOWN;SOFT;2;FPING CRITICAL - psplunk2. (loss=100% )
Program Start[10-29-2009 14:50:49] Nagios 3.0.6 starting... (PID=30236)
Program End[10-29-2009 14:50:48] Caught SIGTERM, shutting down...
Host Down[10-29-2009 14:50:07] HOST ALERT: psplunk2;DOWN;SOFT;4;FPING CRITICAL - psplunk2. (loss=100% )
Service Critical[10-29-2009 14:49:51] SERVICE ALERT: psplunk2;Splunk Daemon;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[10-29-2009 14:48:59] HOST ALERT: psplunk2;DOWN;SOFT;3;FPING CRITICAL - psplunk2. (loss=100% )
Service Critical[10-29-2009 14:48:43] SERVICE ALERT: psplunk2;Time;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 15 seconds.
Service Critical[10-29-2009 14:48:15] SERVICE ALERT: psplunk2;Load Average;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[10-29-2009 14:47:57] HOST ALERT: psplunk2;DOWN;SOFT;2;FPING CRITICAL - psplunk2. (loss=100% )
Service Critical[10-29-2009 14:47:41] SERVICE ALERT: psplunk2;Network Time Daemon;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[10-29-2009 14:47:31] SERVICE ALERT: psplunk2;Free Memory;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 20 seconds.
Program Start[10-29-2009 14:47:10] Nagios 3.0.6 starting... (PID=29609)
Program End[10-29-2009 14:47:09] Caught SIGTERM, shutting down...
Host Down[10-29-2009 14:46:48] HOST ALERT: psplunk2;DOWN;SOFT;2;FPING CRITICAL - psplunk2. (loss=100% )
Service Critical[10-29-2009 14:46:22] SERVICE ALERT: psplunk2;/opt/splunk;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Service Critical[10-29-2009 14:46:22] SERVICE ALERT: psplunk2;/opt/splunk_index1;CRITICAL;HARD;1;CHECK_NRPE: Socket timeout after 10 seconds.
Host Down[10-29-2009 14:45:48] HOST ALERT: psplunk2;DOWN;SOFT;1;FPING CRITICAL - psplunk2. (loss=100% )

The end result being that instead of getting an alert that this host was down in approximately 10 minutes, we got notified in 30 minutes -- essentially as soon as I left the server alone long enough for it to go through 10 failures.

Both the reporting server and the distributed node share the same attributes for retention and soft states:

soft_state_dependencies=0
passive_host_checks_are_soft=1
retain_state_information=1
use_retained_program_state=1
use_retained_scheduling_info=1
retained_host_attribute_mask=0
retained_service_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0

While I would assume the restarts would disrupt Nagios a bit what with having to do start-time tasks again, I would not have expected that it would "start over" with the status of some checks.

What am I missing here?

Thanks

Mark


------------------------------------------------------------------------------
Come build with us! The BlackBerry(R) Developer Conference in SF, CA
is the only developer event you need to attend this year. Jumpstart your
developing skills, take BlackBerry mobile applications to market and stay 
ahead of the curve. Join us from November 9 - 12, 2009. Register now!
http://p.sf.net/sfu/devconference
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list