Recovery not being fired off under certain circumstances

srunschke at abit.de srunschke at abit.de
Mon Nov 28 14:45:11 CET 2005


Hi,

lately I stumbled over a few discrepancies in our network monitoring, that 
is
we were getting Warnings, but never received a Recovery, even though
it was pretty obvious that the service recovered.
I finally was able to pin down the reason for it.

Sadly I am unsure if it has to be seen as "working as intended" or if it 
is
unexpected behaviour really. Personally I'd call it "broken as intended".

Excerpt from the config that reproduces the problem:

define service {
host_name                       RMS
use                             generic-SNMP
service_description             RZ_TEMPERATUR
servicegroups                   SMS-SERVICEGROUP
register                        1
check_command 
check_snmp!abit-management!1.3.6.1.4.1.2769.10.4.1.1.3.1!1!30!35
notification_interval           10
stalking_options                c,w,u
notification_options            c,w,u,r
}

define serviceescalation {
host_name                       RMS
service_description             RZ_TEMPERATUR
first_notification              1
last_notification               0
contact_groups                  HOST-CONTACTGROUP-SMS
escalation_period               24x7
escalation_options              c,r,u
}

As this is the temperature check of our monitoring system for our main 
datacenter,
I do want it to mail me a warning state - but I do not care that much 
about warnings that
I want a SMS yet, the contact-groups of RZ_TEMPERATUR are mail-only 
groups.
I escalate c,r,u into another contactgroup which has the relevant contacts 
with their
pagers in it. Now if the service throws a Warning, we get the mail. But if 
it recovers,
we neither get mail nor SMS.

Reason for that is, that the recovery is falling into the territory of the 
escalation which then
checks who received the notification for this recovery in first place - 
and this check yields no
information for the escalation - therefor not firing off a recovery at 
all.
Even IF the check for that info would be tweaked, it would still fire the 
recovery via
SMS, which is not my intended behaviour.

How do you guys see this particular problem?
Should Nagios be able to act more differenciated (sp?) on these kind of 
problems
or is it my burden to find a hacky-hack solution for this? ;)

I'm up for some insights to this matter.

regards
        sash

--------------------------------------------------
Sascha Runschke
Netzwerk Administration
IT-Services

ABIT AG
Robert-Bosch-Str. 1
40668 Meerbusch

Tel.:+49 (0) 2150.9153.226
Mobil:+49 (0) 173.5419665
mailto:SRunschke at abit.de

http://www.abit.net
http://www.abit-epos.net
---------------------------------
Sicherheitshinweis zur E-Mail Kommunikation /
  Security note regarding email communication:
http://www.abit.net/sicherheitshinweis.html


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click




More information about the Developers mailing list