Recovery Notifications Not Being Sent After Escalations

Stephen Bader sbader at comcast.net
Sat Aug 8 22:36:26 CEST 2009
Previous message: Strengths of Cacti?
Next message: Additional action_url or notes_url
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
I'm experiencing a problem with Nagios not sending recovery alerts to  
groups of users who were notified via escalation of a problem. In the  
example listed below, if a service is critical for more than an hour,  
an escalation is established to send a page. In this case, the page  
was sent after 60 minutes of the service being in the critical state,  
but when the service recovered, a recovery page was not sent. I've  
included the the relevant configuration entries below, and also a log  
from an event which occurred earlier today and did not result in a  
recovery page being sent. I am running Nagios version 3.0.6 on FreeBSD  
7.2.

Here is the service definition:

# Service definition check_local_procs
define service{
        use                             generic-service
        host_name                       NETMGT
        service_description             PROCS
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              2
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  NETWORK-TEAM
        notification_interval           5
        notification_period             24x7
        notification_options            c,r
        check_command                   check_local_procs1!175!190
        }

The contact group NETWORK-TEAM sends an e-mail to all of the members  
of our networking team.

Here is the relevant escalation for this service (an all services,  
actually):

# Send a page after 60 minutes during non work hours if a service is  
down
# VPN-SITES group is excluded from paging during non work hours
define serviceescalation{
        hostgroup_name          !VPN-SITES, .*
        service_description     .*
        first_notification      12
        last_notification       12
        contact_groups          NETWORK-TEAM,NETWORK-TEAM-SNPP
        escalation_period       nonworkhours
        notification_interval   5
        }

The NETWORK-TEAM-SNPP group sends alphanumeric pages to our network  
group. The intention of this escalation is to send a single page to  
the pagers alerting us to a problem. We don't want to get spammed with  
pages, and a single page is sufficient. However, with this escalation,  
when the service recovers, we are only getting notified of the  
recovery to the NETWORK-TEAM contact, and the NETWORK-TEAM-SNPP  
contact is NOT being notified of the recovery.

In the log below, you can see at 13:56 that the notifications were  
escalated to our pagers (via the command notify-by-ipn), however, at  
14:13 when the service recovered, we were only notified of the  
recovery via e-mail.

[08-08-2009 14:13:51] SERVICE NOTIFICATION:  
tech3;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:13:51] SERVICE NOTIFICATION:  
tech2;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes
[08-08-2009 14:13:51] SERVICE NOTIFICATION:  
tech1;NETMGT;PROCS;OK;notify-by-email;PROCS OK: 146 processes

[08-08-2009 14:06:14] SERVICE NOTIFICATION:  
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344  
processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:  
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344  
processes
[08-08-2009 14:06:14] SERVICE NOTIFICATION:  
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 344  
processes

[08-08-2009 14:01:14] SERVICE NOTIFICATION:  
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348  
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:  
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348  
processes
[08-08-2009 14:01:14] SERVICE NOTIFICATION:  
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 348  
processes

[08-08-2009 13:56:15] SERVICE NOTIFICATION: tech3- 
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION: tech2- 
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION: tech1- 
ipn;NETMGT;PROCS;CRITICAL;notify-by-ipn;PROCS CRITICAL: 304 processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:  
tech3;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304  
processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:  
tech2;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304  
processes
[08-08-2009 13:56:14] SERVICE NOTIFICATION:  
tech1;NETMGT;PROCS;CRITICAL;notify-by-email;PROCS CRITICAL: 304  
processes

If you need more parts of the configuration, please let me know. I'm  
not sure why we aren't being notified of the recovery via our pagers,  
because Nagios is supposed to send a recovery notification to everyone  
who was notified of the problem. Is there something wrong with my  
escalation recovery configuration or my understanding of escalations?

Thanks in advance!

-Steve

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Strengths of Cacti?
Next message: Additional action_url or notes_url
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list