Race condition in freshness checking

Ton Voon ton.voon at altinity.com
Mon Sep 24 19:56:36 CEST 2007


Hi!

We found a bug in the calculation of the latency for a passive check.  
This has highlighted a possible race condition re: freshness  
checking. We wanted to get some ideas on what is the best approach to  
fix this.

Background:

We have a master/slave arrangement, with freshness checking  
(freshness_threshold=0) of slave services on the master.

Looking in the NDO db, we realised that the latency values for  
passive results were incorrectly calculate - sometimes latency values  
could be 1000x out. The patch is attached. However, since using this  
patch, we've seen occasional race conditions.

Problem:

Within checks.c:check_service_result_freshness, if a service has past  
its expiration_time, it is marked as is_being_freshened and a forced  
service check is scheduled. However, if a passive result for this  
service is processed before this forced check is run, then the  
service is marked as stale and the state is inconsistent between  
master and slave.

Possible solutions:

   - If a check result is processed with is_being_freshened set for  
the service, then remove forced check from schedule if it exists.
   - Change is_being_freshened to stale_time (0 if not stale). On  
running the forced check, if stale_time is less than last_check_time  
(+ latency?), break out of running the forced check.

None of these sound particularly appealing to us. Are there other  
possible solutions? Any opinions?

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20070924/1d1579c6/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios_corrected_latency_for_passive_results.patch
Type: application/octet-stream
Size: 838 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20070924/1d1579c6/attachment.obj>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20070924/1d1579c6/attachment-0001.html>
-------------- next part --------------
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list