Service checks not resuming after emergence from scheduled downtime

Andreas Ericsson ae at op5.se
Tue Jan 11 10:34:32 CET 2005


Eric Loyd wrote:
> We run Nagios v1.2 (February 02, 2004) to monitor things.  We've been 
> running for a while now with stellar success (after a few idiosyncracies 
> were dealth with).  Now, this past weekend, we ran into the following:
> 
> On Saturdays from 5pm-10pm, we do not monitor a particular service (DMS) 
> because the back-end is down for weekly maintenance.  So we have a 
> check_period defined that allows monitoring all day M-F, all day Sunday, 
> and from 00:00-17:00,22:00-24:00 on Saturday.  This works great.  But 
> here's what happened.  I've included the Nagios logs for anyone that 
> wants to take a look.
> 
> At 16:27, DMS was not running.  Nagios turns SOFT CRITICAL and fires off 
> an event handler (which, internally, does nothing becuase this is not a 
> HARD CRITICAL).  At 17:28, it does this again - now we're on SOFT 
> CRITICAL #2.  At 16:29, it does it again, but because we've gone HARD 
> CRITICAL, the EKrestart event handler actually does something and starts 
> the service.  Notification is sent out now as well.  10 minutes later 
> (the normal_check_interval) DMS is seen to be running and we go back 
> into a HARD OK.  This cycle happens again at 16:49, and the process 
> repeats up to that last check.
> 
> At this point, at 16:51 is when the service went HARD CRITICAL and the 
> event handler does the restart.  At 17:00 we go into a downtime window, 
> and the service is in a 3/3 HARD CRTICIAL state.  Note that the next 
> check would have been approximately 17:01, but since it's in a downtime, 
> it skips this check.  You'd think that it would pick it up after the 
> donwtime, which starts up at 22:00.  So the next check should be 22:01 
> (give or take).
> 
> But Nagios never does this check.  The logs show that it wasn't until 
> 23:10 that someone forced a check via the command interface.  In the 
> meantime, after the downtime window is complete, it does do 
> notifications, but it never does another check.  And, to make it 
> weirder, when the check is forced, it doesn't immediately report back a 
> status, but it's as if it restarts the normal_check_interval timer, 
> because it's 10 minutes later, at 23:21, that the service has a result 
> (HARD OK), showing that someone had gone ahead and restarted DMS outside 
> the context of Nagios.
> 
> To sum up, it looks as if there is a bug when a service enters an outage 
> window in a HARD CRITICAL state, it does not start to check the service 
> when the outage window expires.  Even when a forced check was performed, 
> all it seems to have done is restart the normal_check_interval timer.
> 
> Anyone else see anything like this?
> 

Yes. The handling of hosts/services that come out of downtime is one of 
Nagios' most common areas for bugs. Flapping start/stop is another nice 
area of infinite weirdness.

Ufortunately the 1.x branch of Nagios is no longer being actively 
maintained. Without a clean patch, I can't really see this bug being 
fixed (unless the 2.x solution applies cleanly to 1.x as well). 
Development has moved on to Nagios 2.x, which has been pretty stable 
(not counting the dreadful embedded perl support) for a couple of months 
now. I'll look in to this for the 2.x branch.

I suggest you install 2.0 on a test-system for the time being so you can 
evaluate the stability for yourself. Run ./configure 
--disable-event-broker to keep advanced and untested code to a minimum.

> 
> The timeline major events:
> 
> 16:49 - critical 1/3 SOFT
> 16:50 - critical 2/3 SOFT
> 16:51 - critical 3/3 HARD, event handler restarts DMS
> 17:00 - check_period is no longer valid (downtime)
> 22:00 - check_period is now valid (no more downtime)
> 22:00 - notification sent of service down
> 22:20 - notification sent of service down
> 22:40 - notification sent of service down
> 22:50 - notification sent of service down
> 23:00 - notification sent of service down
> 23:10 - service check forced by hand
> 23:21 - service check returns HARD OK state
> 23:21 - recovery sent via email
> 
> 
> Raw logs:
> 
> Sat Jan  8 16:27:31 2005 - SERVICE ALERT: 
> 908;DMS;CRITICAL;SOFT;1;CRITICAL: DMS is not currently running.
> Sat Jan  8 16:27:31 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;CRITICAL;SOFT;1;EKrestart
> Sat Jan  8 16:28:32 2005 - SERVICE ALERT: 
> 908;DMS;CRITICAL;SOFT;2;CRITICAL: DMS is not currently running.
> Sat Jan  8 16:28:32 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;CRITICAL;SOFT;2;EKrestart
> Sat Jan  8 16:29:31 2005 - SERVICE ALERT: 
> 908;DMS;CRITICAL;HARD;3;CRITICAL: DMS is not currently running.
> Sat Jan  8 16:29:31 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 16:29:32 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;CRITICAL;HARD;3;EKrestart
> Sat Jan  8 16:39:31 2005 - SERVICE ALERT: 908;DMS;OK;HARD;3;OK: DMS is 
> running on PID 1214
> Sat Jan  8 16:39:31 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;OK;notify-by-email;OK: DMS is running on PID 1214
> Sat Jan  8 16:39:31 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;OK;HARD;3;EKrestart
> Sat Jan  8 16:49:31 2005 - SERVICE ALERT: 
> 908;DMS;CRITICAL;SOFT;1;CRITICAL: DMS is not currently running.
> Sat Jan  8 16:49:31 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;CRITICAL;SOFT;1;EKrestart
> Sat Jan  8 16:50:31 2005 - SERVICE ALERT: 
> 908;DMS;CRITICAL;SOFT;2;CRITICAL: DMS is not currently running.
> Sat Jan  8 16:50:31 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;CRITICAL;SOFT;2;EKrestart
> Sat Jan  8 16:51:31 2005 - SERVICE ALERT: 
> 908;DMS;CRITICAL;HARD;3;CRITICAL: DMS is not currently running.
> Sat Jan  8 16:51:31 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 16:51:31 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;CRITICAL;HARD;3;EKrestart
> Sat Jan  8 22:00:11 2005 - SERVICE NOTIFICATION: 
> dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
> currently running.
> Sat Jan  8 22:00:11 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 22:20:11 2005 - SERVICE NOTIFICATION: 
> dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
> currently running.
> Sat Jan  8 22:20:11 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 22:40:01 2005 - SERVICE NOTIFICATION: 
> dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
> currently running.
> Sat Jan  8 22:40:01 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 22:50:01 2005 - SERVICE NOTIFICATION: 
> dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
> currently running.
> Sat Jan  8 22:50:01 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 23:00:01 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
> running.
> Sat Jan  8 23:10:56 2005 - EXTERNAL COMMAND: 
> SCHEDULE_FORCED_SVC_CHECK;908;DMS;1105243853
> Sat Jan  8 23:21:01 2005 - SERVICE ALERT: 908;DMS;OK;HARD;3;OK: DMS is 
> running on PID 26970
> Sat Jan  8 23:21:01 2005 - SERVICE NOTIFICATION: 
> dmaster;908;DMS;OK;notify-by-email;OK: DMS is running on PID 26970
> Sat Jan  8 23:21:02 2005 - SERVICE EVENT HANDLER: 
> 908;DMS;OK;HARD;3;EKrestart
> 
> 
> Lastly, some keywords for people searching the mailing lists for this 
> topic:  service check checks not being performed done after scheduled 
> downtime downtime ends terminates finishes
> 
> 
> -- 
> Eric Loyd
> Drive defensively.  Buy a tank.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
The SF.Net email is sponsored by: Beat the post-holiday blues
Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek.
It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list