Race condition in freshness checking

Andreas Ericsson ae at op5.se
Mon Sep 24 21:46:52 CEST 2007


Ton Voon wrote:
> Hi!
> 
> We found a bug in the calculation of the latency for a passive check. This has 
> highlighted a possible race condition re: freshness checking. We wanted to get 
> some ideas on what is the best approach to fix this.
> 
> Background:
> 
> We have a master/slave arrangement, with freshness checking 
> (freshness_threshold=0) of slave services on the master.
> 
> Looking in the NDO db, we realised that the latency values for passive results 
> were incorrectly calculate - sometimes latency values could be 1000x out. The 
> patch is attached. However, since using this patch, we've seen occasional race 
> conditions.
> 
> Problem:
> 
> Within checks.c:check_service_result_freshness, if a service has past its 
> expiration_time, it is marked as is_being_freshened and a forced service check 
> is scheduled. However, if a passive result for this service is processed before 
> this forced check is run, then the service is marked as stale and the state is 
> inconsistent between master and slave.
> 
> Possible solutions:
> 
>   - If a check result is processed with is_being_freshened set for the service, 
> then remove forced check from schedule if it exists.

Sounds like a good solution, since the service will be marked as 'is_being_checked'
when the check actually runs, in which case it's pointless to update the status
as it will be overwritten by the master's own active check anyways.

>   - Change is_being_freshened to stale_time (0 if not stale). On running the 
> forced check, if stale_time is less than last_check_time (+ latency?), break out 
> of running the forced check.
> 

This I didn't quite get. You mean the passive check should alter the figure passed
in is_being_freshened? If so, what if stale_time is exactly 1? How can Nagios then
determine that it's actually received a result rather than just being updated by
the passive check-result coming in.

I'm sure you thought of it, but the simplest way should be to re-check the timer
since last check arrived when the forced check is being run, and cancel it if it's
fresh enough then. That way you'll keep the the change to a single spot in the code
and it'll be quite maintainable, provided some comment is added that explains the
anomaly in the code-path.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/




More information about the Developers mailing list