Specifying the retention period

Rahul Amaram rahul.amaram at vizury.com
Wed Sep 10 12:17:23 CEST 2014


Following up on the earlier topic, I am seeing the below errors related 
to cache purge. Any idea on what might be causing this? I don't see any 
other errors in log related to metrics.

2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ; 
com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge 
purging 180
2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ; 
com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge 
executed in 1 ms
2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ; 
org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge threw an 
unhandled Exception: java.lang.NullPointerException: null
         at 
com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250)
         at 
com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140)

2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ; 
org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge threw an 
exception.org.quartz.SchedulerException: Job threw an unhandled exception.
         at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
         at 
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
Caused by: java.lang.NullPointerException: null
         at 
com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250)
         at 
com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140)

Here is my cache configuration:

     <cache>
       <aggregate>
         <method>avg</method>
         <useweekend>true</useweekend>
         <retention>
           <period>H</period>
           <offset>720</offset>
         </retention>
         <retention>
           <period>D</period>
           <offset>30</offset>
         </retention>
       </aggregate>

       <purge>
         <offset>30</offset>
         <period>D</period>
       </purge>
     </cache>

Regards,
Rahul.
On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
> Great if you can make a debian package, and I understand that you can 
> not commit. The best thing would be integrated to our build process 
> where we use ant.
>
> if the purging is based on time then it could happen that data is 
> removed from the cache since the logic is based on time relative to 
> now. To avoid it you should increase the purge time before you start 
> bischeck. And just a comment on your last sentence Redis TTl is never 
> used :)
> Anders
>
> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>> I would be more than happy to give you guys a testimonial. However, we
>> have just taken this live and would like to see its performance before I
>> give a testimonial.
>>
>> Also, if time permits, I'll try to bundle this for Debian (I'm a Debian
>> maintainer). I can't commit on a timeline right away though :).
>>
>> Also, just to make things explicitly clear. I understand that the below
>> service item ttl has nothing to do with Redis TTL. But If I stop my
>> bischeck server for a day or two, then would any of my metrics get lost?
>> Or would I have to increase th Redis TTL for this.
>>
>> Regards,
>> Rahul.
>>
>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>> Glad that it clarified how to configure the cache section. I will make
>>> a blog post on this in the mean time, until we have a updated
>>> documentation. I agree with you that the structure of the
>>> configuration is a bit "heavy", so ideas and input is appreciated.
>>>
>>> Regarding redis ttl, this is a redis feature we do not use. The ttl
>>> mentioned in my mail is managed by bischeck. Redis ttl on linked list
>>> do not work on individual nodes in a redis linked list.
>>>
>>> Currently the bischeck installer should work for ubuntu, redhat/centos
>>> and debian. There is currently no plans to make distribution packages
>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck make a
>>> bischeck rpm. It would be super if there is any one that like to do
>>> this for the project.
>>> When it comes to packaging we have done a bit of work to create docker
>>> containers, but its still experimental.
>>>
>>> I also encourage you, if you think bischeck support your monitoring
>>> effort, to write a small testimony that we can put on the site.
>>> Regards
>>> Anders
>>>
>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>> Thanks Anders. This explains precisely why my data was getting purged
>>>> after 16 hours (30 values per hour * 1 hours = 480). It would be great
>>>> if you could update the documentation with this info. The entire setup
>>>> and configuration itself takes time to get a hold on and detailed
>>>> documentation would be very helpful.
>>>>
>>>> Also, another quick question? Right now, I believe the Redis TTL is 
>>>> set
>>>> to 2000 seconds. Does this mean that if I don't receive data for a
>>>> particular serviceitem (or service or host) for a 2000 seconds, the 
>>>> data
>>>> related to it is lost?
>>>>
>>>> Also, any plans for bundling this with distributions such as Debian?
>>>>
>>>> Regards,
>>>> Rahul.
>>>>
>>>>
>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>> Hi Rahul,
>>>>> Thanks for the question and feedback on the documentation. Great to
>>>>> hear that you think Bischeck is awesome. If you do not understand how
>>>>> it works by reading the documentation you are probably not alone, and
>>>>> we should consider it a documentation bug.
>>>>>
>>>>> In 1.0.0 we introduce the concept that you asking about and it really
>>>>> two different independent features.
>>>>>
>>>>> Lets start with cache purging.
>>>>> Collected monitoring data, metrics, are kept in the cache (redis from
>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>> definition, like host1-service1-serviceitem1.  Prior to 1.0.0 all the
>>>>> linked lists had the same size that was defined with the property
>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable so it
>>>>> could be defined per service definition.
>>>>> To enable individual cache configurations we added a section called
>>>>> <cache> in the serviceitem section of the bischeck.xml. Like many
>>>>> other configuration options in 1.0.0 the cache section could have the
>>>>> specific values or point to a template that could be shared.
>>>>> To manage the size of the cache , or to be more specific the linked
>>>>> list size, we defined the <purge> section. The purge section can have
>>>>> two different configurations. The first is defining the max size of
>>>>> the cache linked list.
>>>>> <cache>
>>>>>   <purge>
>>>>>    <maxcount>1000</maxcount>
>>>>>   </purge>
>>>>> </cache>
>>>>>
>>>>> The second options is to define the “time to live” for the metrics in
>>>>> the cache.
>>>>> <cache>
>>>>>   <purge>
>>>>>    <offset>10</offset>
>>>>>    <period>D</period>
>>>>>   </purge>
>>>>> </cache>
>>>>> In the above example we set the time to live to 10 days. So any
>>>>> metrics older then this period will be removed. The period can have
>>>>> the following values:
>>>>> H - hours
>>>>> D - days
>>>>> W - weeks
>>>>> Y - year
>>>>>
>>>>> The two option are mutual exclusive. You have to chose one for each
>>>>> serviceitem or cache template.
>>>>>
>>>>> If no cache directive is define for a serviceitem the property
>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>
>>>>> Hopefully this explains the cache purging.
>>>>>
>>>>> The next question was related to aggregations which has nothing to do
>>>>> with purging, but it's configured in the same <cache> section. The
>>>>> idea with aggregations was to create an automatic way to aggregate
>>>>> metrics on the level of an hour, day, week and month. The aggregation
>>>>> functions current supported is average, max and min.
>>>>> Lets say you have a service definition of the format
>>>>> host1-service1-serviceitem1. When you  enable an average (avg)
>>>>> aggregation you will automatically get the following new service
>>>>> definitions
>>>>> host1-service1/H/avg-serviceitem1
>>>>> host1-service1/D/avg-serviceitem1
>>>>> host1-service1/W/avg-serviceitem1
>>>>> host1-service1/M/avg-serviceitem1
>>>>>
>>>>> The configuration you need to achive the above average 
>>>>> aggregations is:
>>>>> <cache>
>>>>>   <aggregate>
>>>>>     <method>avg</method>
>>>>>   </aggregate>
>>>>> </cache>
>>>>>
>>>>> If you like to combine it with the above descibed purging your
>>>>> configuration would look like:
>>>>> <cache>
>>>>>   <aggregate>
>>>>>     <method>avg</method>
>>>>>   </aggregate>
>>>>>
>>>>>   <purge>
>>>>>    <offset>10</offset>
>>>>>    <period>D</period>
>>>>>   </purge>
>>>>> </cache>
>>>>>
>>>>> The new aggregated service definitions,
>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own cache
>>>>> entries and can be used in threshold configurations and virtual
>>>>> services like any other service definitions. For example in a
>>>>> threshold hours section we could define
>>>>>
>>>>> <hours hoursID="2">
>>>>>
>>>>>   <hourinterval>
>>>>>     <from>09:00</from>
>>>>>     <to>12:00</to>
>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>>>>   </hourinterval>
>>>>>   ...
>>>>>
>>>>> This would mean that we use the average value for
>>>>> host1-service1-serviceitem1  for the period of the last hour.
>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>
>>>>> By default weekends metrics are not included in the aggrgation
>>>>> calculation. This can be enabled by setting the
>>>>> <useweekend>true</useweekend>:
>>>>>
>>>>> <cache>
>>>>>   <aggregate>
>>>>>     <method>avg</method>
>>>>>     <useweekend>true</useweekend>
>>>>>   </aggregate>
>>>>>   ….
>>>>> </cache>
>>>>>
>>>>> This will create aggregated service definitions with the following
>>>>> name standard:
>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>
>>>>> You can also have multiple entries like:
>>>>> <cache>
>>>>>   <aggregate>
>>>>>     <method>avg</method>
>>>>>     <useweekend>true</useweekend>
>>>>>   </aggregate>
>>>>>   <aggregate>
>>>>>     <method>max</method>
>>>>>   </aggregate>
>>>>>   ….
>>>>> </cache>
>>>>>
>>>>> So how long time will the aggregated values be kept in the cache? By
>>>>> default we save
>>>>> Hour aggregation for 25 hours
>>>>> Daily aggregations for 7 days
>>>>> Weekly aggregations for 5 weeks
>>>>> Monthly aggregations for 1 month
>>>>>
>>>>> These values can be override but they can not be lower then the
>>>>> default. Below you have an example where we save the aggregation for
>>>>> 168 hours, 60 days and 53 weeks.
>>>>> <cache>
>>>>>   <aggregate>
>>>>>     <method>avg</method>
>>>>>     <useweekend>true</useweekend>
>>>>>     <retention>
>>>>>       <period>H</period>
>>>>>       <offset>168</offset>
>>>>>     </retention>
>>>>>     <retention>
>>>>>      <period>D</period>
>>>>>       <offset>60</offset>
>>>>>     </retention>
>>>>>     <retention>
>>>>>       <period>W</period>
>>>>>       <offset>53</offset>
>>>>>     </retention>
>>>>> </aggregate>
>>>>>   ….
>>>>> </cache>
>>>>>
>>>>> I hope this makes it a bit less confusing. What is clear to me is 
>>>>> that
>>>>> we need to improve the documentation in this area.
>>>>>
>>>>> Looking forward to your feedback.
>>>>> Anders
>>>>>
>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>> Hi,
>>>>>> I am trying to setup the bischeck plugin for our organization. I 
>>>>>> have
>>>>>> configured most part of it except for the cache retention period. 
>>>>>> Here
>>>>>> is what I want - I want to store every value which has been 
>>>>>> generated
>>>>>> during the past 1 month. The reason being my threshold is currently
>>>>>> calculated as the average of the metric value during the past 4
>>>>>> weeks at
>>>>>> the same time of the day.
>>>>>>
>>>>>> So, how do I define the cache template for this? If I don't 
>>>>>> define any
>>>>>> cache template, for how many days is the data kept?
>>>>>> Also, how does the aggregrate function work and and what does the
>>>>>> purge
>>>>>> Maxitems signify?
>>>>>>
>>>>>> I've gone through the documentation but it wasn't clear. Looking
>>>>>> forward
>>>>>> to a response.
>>>>>>
>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>
>>>>>> Regards,
>>>>>> Rahul.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 



More information about the Bischeck-users mailing list