BUG: Service Reaper does not reschedule Service-Checks

Percy Jahn jahn at fg-networking.de
Thu Aug 30 15:52:40 CEST 2007


Hello Ethan and others,

we are using a redundant Nagios-System with keepalived for 
IP-Transition. The Problem now occuring is that service checks get 
"lost" and are never scheduled again.
I've located the problem in schedule_service_check(). In case of an 
keepalived transition, nagios gets a STOP_EXECUTING_SVC_CHECKS, 
DISABLE_NOTIFICATIONS or ENABLE_NOTIFICATIONS, 
START_EXECUTING_SVC_CHECKS on the other hand. If nagios got outstanding 
checks while receiving "disable notifications" it sets the global status 
accordingly. reap_service_checks() gets the check results from the 
outstanding properly scheduled service checks and trys to reschedule the 
servicecheck via schedule_service_check(). This function immediately 
exists without rescheduling, because active checks are disabled globaly. 
In the end, the service is lost and could not be rescheduled. 
check_for_orphaned_services() could not solve this problem, because the 
check is marked as "not executing/running" by reap_service_checks().

My first solution is to adapt schedule_service_check() to schedule all 
services (including the not active ones), but i believe this could break 
some other stuff. Ethan could you please take a closer look at this?

I'm using Nagios version 2.6 and checked the Changelog, but nothing 
concerning my problem is mentioned. In the meanwhile i solved the 
problem for my case, via "sighup"ing nagios in case of an transition.

best regards
Percy Jahn


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/




More information about the Developers mailing list