Service checks not resuming after emergence from scheduled downtime

Eric Loyd loyd at cyber.kodak.com
Mon Jan 10 17:32:23 CET 2005


We run Nagios v1.2 (February 02, 2004) to monitor things.  We've been 
running for a while now with stellar success (after a few idiosyncracies 
were dealth with).  Now, this past weekend, we ran into the following:

On Saturdays from 5pm-10pm, we do not monitor a particular service (DMS) 
because the back-end is down for weekly maintenance.  So we have a 
check_period defined that allows monitoring all day M-F, all day Sunday, 
and from 00:00-17:00,22:00-24:00 on Saturday.  This works great.  But 
here's what happened.  I've included the Nagios logs for anyone that wants 
to take a look.

At 16:27, DMS was not running.  Nagios turns SOFT CRITICAL and fires off an 
event handler (which, internally, does nothing becuase this is not a HARD 
CRITICAL).  At 17:28, it does this again - now we're on SOFT CRITICAL 
#2.  At 16:29, it does it again, but because we've gone HARD CRITICAL, the 
EKrestart event handler actually does something and starts the 
service.  Notification is sent out now as well.  10 minutes later (the 
normal_check_interval) DMS is seen to be running and we go back into a HARD 
OK.  This cycle happens again at 16:49, and the process repeats up to that 
last check.

At this point, at 16:51 is when the service went HARD CRITICAL and the 
event handler does the restart.  At 17:00 we go into a downtime window, and 
the service is in a 3/3 HARD CRTICIAL state.  Note that the next check 
would have been approximately 17:01, but since it's in a downtime, it skips 
this check.  You'd think that it would pick it up after the donwtime, which 
starts up at 22:00.  So the next check should be 22:01 (give or take).

But Nagios never does this check.  The logs show that it wasn't until 23:10 
that someone forced a check via the command interface.  In the meantime, 
after the downtime window is complete, it does do notifications, but it 
never does another check.  And, to make it weirder, when the check is 
forced, it doesn't immediately report back a status, but it's as if it 
restarts the normal_check_interval timer, because it's 10 minutes later, at 
23:21, that the service has a result (HARD OK), showing that someone had 
gone ahead and restarted DMS outside the context of Nagios.

To sum up, it looks as if there is a bug when a service enters an outage 
window in a HARD CRITICAL state, it does not start to check the service 
when the outage window expires.  Even when a forced check was performed, 
all it seems to have done is restart the normal_check_interval timer.

Anyone else see anything like this?


The timeline major events:

16:49 - critical 1/3 SOFT
16:50 - critical 2/3 SOFT
16:51 - critical 3/3 HARD, event handler restarts DMS
17:00 - check_period is no longer valid (downtime)
22:00 - check_period is now valid (no more downtime)
22:00 - notification sent of service down
22:20 - notification sent of service down
22:40 - notification sent of service down
22:50 - notification sent of service down
23:00 - notification sent of service down
23:10 - service check forced by hand
23:21 - service check returns HARD OK state
23:21 - recovery sent via email


Raw logs:

Sat Jan  8 16:27:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;1;CRITICAL: 
DMS is not currently running.
Sat Jan  8 16:27:31 2005 - SERVICE EVENT HANDLER: 
908;DMS;CRITICAL;SOFT;1;EKrestart
Sat Jan  8 16:28:32 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;2;CRITICAL: 
DMS is not currently running.
Sat Jan  8 16:28:32 2005 - SERVICE EVENT HANDLER: 
908;DMS;CRITICAL;SOFT;2;EKrestart
Sat Jan  8 16:29:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;HARD;3;CRITICAL: 
DMS is not currently running.
Sat Jan  8 16:29:31 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 16:29:32 2005 - SERVICE EVENT HANDLER: 
908;DMS;CRITICAL;HARD;3;EKrestart
Sat Jan  8 16:39:31 2005 - SERVICE ALERT: 908;DMS;OK;HARD;3;OK: DMS is 
running on PID 1214
Sat Jan  8 16:39:31 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;OK;notify-by-email;OK: DMS is running on PID 1214
Sat Jan  8 16:39:31 2005 - SERVICE EVENT HANDLER: 908;DMS;OK;HARD;3;EKrestart
Sat Jan  8 16:49:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;1;CRITICAL: 
DMS is not currently running.
Sat Jan  8 16:49:31 2005 - SERVICE EVENT HANDLER: 
908;DMS;CRITICAL;SOFT;1;EKrestart
Sat Jan  8 16:50:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;SOFT;2;CRITICAL: 
DMS is not currently running.
Sat Jan  8 16:50:31 2005 - SERVICE EVENT HANDLER: 
908;DMS;CRITICAL;SOFT;2;EKrestart
Sat Jan  8 16:51:31 2005 - SERVICE ALERT: 908;DMS;CRITICAL;HARD;3;CRITICAL: 
DMS is not currently running.
Sat Jan  8 16:51:31 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 16:51:31 2005 - SERVICE EVENT HANDLER: 
908;DMS;CRITICAL;HARD;3;EKrestart
Sat Jan  8 22:00:11 2005 - SERVICE NOTIFICATION: 
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
currently running.
Sat Jan  8 22:00:11 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 22:20:11 2005 - SERVICE NOTIFICATION: 
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
currently running.
Sat Jan  8 22:20:11 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 22:40:01 2005 - SERVICE NOTIFICATION: 
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
currently running.
Sat Jan  8 22:40:01 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 22:50:01 2005 - SERVICE NOTIFICATION: 
dmaster-page;908;DMS;CRITICAL;notify-by-epager;CRITICAL: DMS is not 
currently running.
Sat Jan  8 22:50:01 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 23:00:01 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;CRITICAL;notify-by-email;CRITICAL: DMS is not currently 
running.
Sat Jan  8 23:10:56 2005 - EXTERNAL COMMAND: 
SCHEDULE_FORCED_SVC_CHECK;908;DMS;1105243853
Sat Jan  8 23:21:01 2005 - SERVICE ALERT: 908;DMS;OK;HARD;3;OK: DMS is 
running on PID 26970
Sat Jan  8 23:21:01 2005 - SERVICE NOTIFICATION: 
dmaster;908;DMS;OK;notify-by-email;OK: DMS is running on PID 26970
Sat Jan  8 23:21:02 2005 - SERVICE EVENT HANDLER: 908;DMS;OK;HARD;3;EKrestart


Lastly, some keywords for people searching the mailing lists for this 
topic:  service check checks not being performed done after scheduled 
downtime downtime ends terminates finishes


--
Eric Loyd
Drive defensively.  Buy a tank. 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050110/b258a5a2/attachment.html>


More information about the Users mailing list