Possibility to avoid certain values which are way too deviant while calculating threshold

Rahul Amaram rahul.amaram at vizury.com
Thu Dec 18 04:05:51 CET 2014


I believe the MAD approach works best for us. So, here is some sample data:

25.33, 30.45, 22.43, 35.86, 30123.45, 50125.5

More often however, there would be only one outlier.

Thanks,
Rahul.


On Thursday 18 December 2014 02:27 AM, Anders Håål wrote:
> Sorry for the link - 
> http://stats.stackexchange.com/questions/38001/detecting-outliers-using-standard-deviations
>
>
> The problem is not to write the code, the problem is to find a logic 
> to determine which numbers to remove from the data set. What is a 
> deviation from the normal difference in the set?
>
> Googling a bit more I found these definitions that may be applicable 
> using stdev for your use case:
>
> *Mean and Standard Deviation Method**
> *For this outlier detection method, the mean and standard deviation of 
> the residuals are calculated and compared. If a value is a certain 
> number of standard deviations away from the mean, that data point is 
> identified as an outlier. The specified number of standard deviations 
> is called the threshold. The default value is 3.
>
> This method can fail to detect outliers because the outliers increase 
> the standard deviation. The more extreme the outlier, the more the 
> standard deviation is affected.
>
> *Median and Median Absolute Deviation Method (MAD)**
> *
> For this outlier detection method, the median of the residuals is 
> calculated. Then, the difference is calculated between each historical 
> value and this median. These differences are expressed as their 
> absolute values, and a new median is calculated and multiplied by an 
> empirically derived constant to yield the median absolute deviation 
> (MAD). If a value is a certain number of MAD away from the median of 
> the residuals, that value is classified as an outlier. The default 
> threshold is 3 MAD.
>
> This method is generally more effective than the mean and standard 
> deviation method for detecting outliers, but it can be too aggressive 
> in classifying values that are not really extremely different. Also, 
> if more than 50% of the data points have the same value, MAD is 
> computed to be 0, so any value different from the residual median is 
> classified as an outlier.
>
> *Median and Interquartile Deviation Method (IQD)*
>
> For this outlier detection method, the median of the residuals is 
> calculated, along with the 25th percentile and the 75th percentile. 
> The difference between the 25th and 75th percentile is the 
> interquartile deviation (IQD). Then, the difference is calculated 
> between each historical value and the residual median. If the 
> historical value is a certain number of MAD away from the median of 
> the residuals, that value is classified as an outlier. The default 
> threshold is 2.22, which is equivalent to 3 standard deviations or MADs.
>
> This method is somewhat susceptible to influence from extreme 
> outliers, but less so than the mean and standard deviation method. Box 
> plots are based on this approach. The median and interquartile 
> deviation method can be used for both symmetric and asymmetric data.
>
> If you find a method that you think could work, we could implement it 
> together and you can verify it with your data. Can you say anything 
> about the data collected?
> Anders
>
> On 12/17/2014 09:25 PM, Rahul Amaram wrote:
>> Hi Andre,
>>
>> So, I would like to remove the outlier and calculate the mean for the 
>> remaining elements. Any suggestion apart from writing my own custom 
>> math function? Also, I don't think that you have shared the link.
>>
>> Thanks,
>> Rahul.
>>
>> On Thursday 18 December 2014 12:55 AM, Anders Håål wrote:
>>> Hi Rahul,
>>> Its possible, but the question is what algorithm to use. The second 
>>> question would also be what would you do with the remaining set, 
>>> calculate a mean?
>>> When it comes to exclude a deviant value it sound close to what is 
>>> called a outlier, http://en.wikipedia.org/wiki/Outlier. There are a 
>>> number of mathematical solutions to this problem, but not sure which 
>>> would be applicable or correct. Check this link for a discussions on 
>>> the topic where one approach is using standard deviation, but from 
>>> the discussion it does not sound like a statistical correct approach.
>>>
>>> If you or anyone else on this list find an good approach, I more 
>>> then happy to try it. In Bischeck its possible to plug in your own 
>>> functions as described in 
>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-6.2 
>>> so you can easily do your own testing. Using the cache browser cli 
>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4 
>>> you can easily test your function.
>>>
>>> Anders
>>>
>>>
>>> On 12/17/2014 03:40 PM, Rahul Amaram wrote:
>>>> Hi,
>>>>
>>>> I had a quick question. Let us say we calculate the threshold based 
>>>> on the values of the past six days, one value per day. Now let us 
>>>> say, out of 6 values, one of these values is way too deviant. Then 
>>>> is it possible to exclude this deviant value from calculating the 
>>>> threshold?
>>>>
>>>> Thanks,
>>>> Rahul.
>>>
>>>
>>
>
>
> -- 
>
> Ingby<http://www.ingby.com>
>
> IngbyForge<http://gforge.ingby.com>
>
> bischeck - dynamic and adaptive thresholds for Nagios<http://www.bischeck.org>
>
> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>
> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>
> Ingenjörsbyn
> Box 531
> 101 30 Stockholm
> Sweden
> www.ingby.com  <http://www.ingby.com/>
> Mobil: +46 70 575 35 46
> Tele: +46 75 75 75 090
> Fax:  +46 75 75 75 091
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/bischeck-users/attachments/20141218/8e4ca30d/attachment-0001.html>


More information about the Bischeck-users mailing list