Caching (?) problem with nagios 2.7

Thomas Schimpke schimpke.thomas at bhn-services.com
Sat Feb 10 10:52:39 CET 2007


Hello,

since a few days I'm having trouble with my nagios setup. The first
indication was, that I'm having trouble sending out host notifications
(but that will be another thread soon). So this morning I decided to
check, if I've also trouble with service notifications.

I took a service that is checked frequently and changed the check
command so that it would fail, generating an error resulting in an hard
state. The service definition looks like this:

# SAP Login
# ---------------------------------------------------------------------
#
define service {
  use                   sap_check
  host_name             eulep04

  service_description   SAP Logon
  check_command         check_sap!00
  servicegroups         SAP Logon ERP Prod

  max_check_attempts    2
  normal_check_interval 3
  retry_check_interval  1

  notification_interval 30
  notification_options  c,r
  contact_groups        rz
}

and the template

define service {
	name 			check_sap
	use  			generic-service
	is_volatile 		0
	freshness_treshold      0
	check_freshness		0
	notification_period	 24x7
	process_perf_data       0
	register                0
}

and 

define service  {
	name		generic_service
	acitve_checks_enabled 1
	passive_checks_enabled 1
	parallelize_checks 1
	check_period 24x7
	obsess_over_service 0
	notifications_enabled 1
	event_handler_enabled 1
	flap_detection_enabled 1
	process_peref_data 1
	retain_status_information 1
	retain_nonstatus_information 1
	register 0
}
 
(I've typed in the two templates -- so syntax errors may be due to
transcription). This configuration worked for a long time now, I think
without any problems.

What I did was, that I changed the instance number in the check_comand
from 00 to 10. This check would fail, since we have no SAP system witth
instance number 10. After saving my changes I reloaded nagios's
configuration (/etc/rc.d/init.d/nagios reload). Then I waited. Actually
I waited for a long time -- 15 Minutes or so. The service stayed in
state OK. I saw, that for this time nagios did not perform *any* checks
of this service (I looked at the last check time in the service
overview). I verified, that nagios re-read the new configuration
successfully -- I looked at the service check command under "view
configuration". So I forced an service check via the CGI. That helped...

 [1171096494] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;eulep04;SAP
Logon;1171096485
[1171096499] SERVICE ALERT: eulep04;SAP Logon;CRITICAL;SOFT;1;SAP System
on host xxx.xxx.xxx.xxx (instance 10 ) is down.
[1171096559] SERVICE ALERT: eulep04;SAP Logon;CRITICAL;HARD;2;SAP System
on host xxx.xxx.xxx.xxx (instance 10 ) is down.
[1171096559] SERVICE NOTIFICATION: rz_call_home;eulep04;SAP
Logon;CRITICAL;service_notify_by_call;SAP System on host xxx.xxx.xxx.xxx
(instance 10 ) is down.

So my service notification worked -- I received a call. BUT it worked
only, after i forced the check.

So I decided to re-check: I changed the instance number back to 00 and
restarted nagios:

[1171096695] Caught SIGHUP, restarting...
[1171096695] Nagios 2.7 starting... (PID=14025)
[1171096695] LOG VERSION: 2.0
[1171096696] INITIAL HOST STATE: apps;UP;HARD;1;PING OK - Packet loss
... (many more of the initial host/service states)

then I waited for about 20 minutes. The service was never checked !
I forced the check and then:

[1171097971] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;eulep04;SAP
Logon;1171097965
[1171097976] SERVICE ALERT: eulep04;SAP Logon;OK;HARD;2;SAP System on
xxx.xxx.xxx.xxx (instance 00) is up.
[1171097976] SERVICE NOTIFICATION: rz_call_home;eulep04;SAP
Logon;OK;service_notify_by_call;SAP System on xxx.xxx.xxx.xxx (instance
00) is up.

Has someone an idea what's going on here or give me a hint how to
resolve this issue ? I'm feeling quite bad about this situation because
we have a large installation and we (like everyone on this list i
suppose) depend upon nagios...  I' m not sure, if this was also an issue
with nagios 2.5 -- as I wrote I upgraded to 2.7 on friday because I had
(and still have) problems with host notifications. 

Also strange: we were running nagios 2.3.1 for a long time on another
machine (RedHat 9, 32Bit -- so very old) without any problems. The
current problems appear on a 64Bit FC5 machine (I migrated my nagios
installation several weeks ago to nagios 2.5 there and did not notice
these problems -- but I may have overlooked them).

Thanks in advance for any help & ideas

Thomas



-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list