Specifying the retention period

Rahul Amaram rahul.amaram at vizury.com
Mon Sep 8 11:30:46 CEST 2014


Thanks Anders. This explains precisely why my data was getting purged 
after 16 hours (30 values per hour * 1 hours = 480). It would be great 
if you could update the documentation with this info. The entire setup 
and configuration itself takes time to get a hold on and detailed 
documentation would be very helpful.

Also, another quick question? Right now, I believe the Redis TTL is set 
to 2000 seconds. Does this mean that if I don't receive data for a 
particular serviceitem (or service or host) for a 2000 seconds, the data 
related to it is lost?

Also, any plans for bundling this with distributions such as Debian?

Regards,
Rahul.


On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
> Hi Rahul,
> Thanks for the question and feedback on the documentation. Great to 
> hear that you think Bischeck is awesome. If you do not understand how 
> it works by reading the documentation you are probably not alone, and 
> we should consider it a documentation bug.
>
> In 1.0.0 we introduce the concept that you asking about and it really 
> two different independent features.
>
> Lets start with cache purging.
> Collected monitoring data, metrics, are kept in the cache (redis from 
> 1.0.0) as a linked lists. There is one linked list per service 
> definition, like host1-service1-serviceitem1.  Prior to 1.0.0 all the 
> linked lists had the same size that was defined with the property 
> lastStatusCacheSize. But in 1.0.0 we made that configurable so it 
> could be defined per service definition.
> To enable individual cache configurations we added a section called 
> <cache> in the serviceitem section of the bischeck.xml. Like many 
> other configuration options in 1.0.0 the cache section could have the 
> specific values or point to a template that could be shared.
> To manage the size of the cache , or to be more specific the linked 
> list size, we defined the <purge> section. The purge section can have 
> two different configurations. The first is defining the max size of 
> the cache linked list.
> <cache>
>   <purge>
>    <maxcount>1000</maxcount>
>   </purge>
> </cache>
>
> The second options is to define the “time to live” for the metrics in 
> the cache.
> <cache>
>   <purge>
>    <offset>10</offset>
>    <period>D</period>
>   </purge>
> </cache>
> In the above example we set the time to live to 10 days. So any 
> metrics older then this period will be removed. The period can have 
> the following values:
> H - hours
> D - days
> W - weeks
> Y - year
>
> The two option are mutual exclusive. You have to chose one for each 
> serviceitem or cache template.
>
> If no cache directive is define for a serviceitem the property 
> lastStatusCacheSize will be used. It's default value is 500.
>
> Hopefully this explains the cache purging.
>
> The next question was related to aggregations which has nothing to do 
> with purging, but it's configured in the same <cache> section. The 
> idea with aggregations was to create an automatic way to aggregate 
> metrics on the level of an hour, day, week and month. The aggregation 
> functions current supported is average, max and min.
> Lets say you have a service definition of the format 
> host1-service1-serviceitem1. When you  enable an average (avg) 
> aggregation you will automatically get the following new service 
> definitions
> host1-service1/H/avg-serviceitem1
> host1-service1/D/avg-serviceitem1
> host1-service1/W/avg-serviceitem1
> host1-service1/M/avg-serviceitem1
>
> The configuration you need to achive the above average aggregations is:
> <cache>
>   <aggregate>
>     <method>avg</method>
>   </aggregate>
> </cache>
>
> If you like to combine it with the above descibed purging your 
> configuration would look like:
> <cache>
>   <aggregate>
>     <method>avg</method>
>   </aggregate>
>
>   <purge>
>    <offset>10</offset>
>    <period>D</period>
>   </purge>
> </cache>
>
> The new aggregated service definitions, 
> host1-service1/H/avg-serviceitem1, etc, will have their own cache 
> entries and can be used in threshold configurations and virtual 
> services like any other service definitions. For example in a 
> threshold hours section we could define
>
> <hours hoursID="2">
>
>   <hourinterval>
>     <from>09:00</from>
>     <to>12:00</to>
> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>   </hourinterval>
>   ...
>
> This would mean that we use the average value for 
> host1-service1-serviceitem1  for the period of the last hour.
> Aggregations are calculated hourly, daily, weekly and monthly.
>
> By default weekends metrics are not included in the aggrgation 
> calculation. This can be enabled by setting the 
> <useweekend>true</useweekend>:
>
> <cache>
>   <aggregate>
>     <method>avg</method>
>     <useweekend>true</useweekend>
>   </aggregate>
>   ….
> </cache>
>
> This will create aggregated service definitions with the following 
> name standard:
> host1-service1/H/avg/weekend-serviceitem1
> host1-service1/D/avg/weekend-serviceitem1
> host1-service1/W/avg/weekend-serviceitem1
> host1-service1/M/avg/weekend-serviceitem1
>
> You can also have multiple entries like:
> <cache>
>   <aggregate>
>     <method>avg</method>
>     <useweekend>true</useweekend>
>   </aggregate>
>   <aggregate>
>     <method>max</method>
>   </aggregate>
>   ….
> </cache>
>
> So how long time will the aggregated values be kept in the cache? By 
> default we save
> Hour aggregation for 25 hours
> Daily aggregations for 7 days
> Weekly aggregations for 5 weeks
> Monthly aggregations for 1 month
>
> These values can be override but they can not be lower then the 
> default. Below you have an example where we save the aggregation for 
> 168 hours, 60 days and 53 weeks.
> <cache>
>   <aggregate>
>     <method>avg</method>
>     <useweekend>true</useweekend>
>     <retention>
>       <period>H</period>
>       <offset>168</offset>
>     </retention>
>     <retention>
>      <period>D</period>
>       <offset>60</offset>
>     </retention>
>     <retention>
>       <period>W</period>
>       <offset>53</offset>
>     </retention>
> </aggregate>
>   ….
> </cache>
>
> I hope this makes it a bit less confusing. What is clear to me is that 
> we need to improve the documentation in this area.
>
> Looking forward to your feedback.
> Anders
>
> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>> Hi,
>> I am trying to setup the bischeck plugin for our organization. I have
>> configured most part of it except for the cache retention period. Here
>> is what I want - I want to store every value which has been generated
>> during the past 1 month. The reason being my threshold is currently
>> calculated as the average of the metric value during the past 4 weeks at
>> the same time of the day.
>>
>> So, how do I define the cache template for this? If I don't define any
>> cache template, for how many days is the data kept?
>> Also, how does the aggregrate function work and and what does the purge
>> Maxitems signify?
>>
>> I've gone through the documentation but it wasn't clear. Looking forward
>> to a response.
>>
>> Bischeck is one awesome plugin. Keep up the great work.
>>
>> Regards,
>> Rahul.
>>
>
>


-- 



More information about the Bischeck-users mailing list