Specifying the retention period

Rahul Amaram rahul.amaram at vizury.com
Wed Sep 10 13:20:00 CEST 2014


That indeed seems to be the problem. Using count rather than period 
seems to address the issue. Raised a ticket - 
http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=259 
.

Thanks,
Rahul.

On Wednesday 10 September 2014 04:02 PM, Anders Håål wrote:
> This looks like a bug. Could you please report it on 
> http://gforge.ingby.com/gf/project/bischeck/tracker/ in the Bugs 
> tracker. You need a account but its just a sign up and you get an 
> email confirmation.
> Can you try to use maxcount for purging instead as a work around? Just 
> calculate your maxcount based on the scheduling interval you use.
> Anders
>
> On 09/10/2014 12:17 PM, Rahul Amaram wrote:
>> Following up on the earlier topic, I am seeing the below errors related
>> to cache purge. Any idea on what might be causing this? I don't see any
>> other errors in log related to metrics.
>>
>> 2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>> purging 180
>> 2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>> executed in 1 ms
>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>> org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge threw an
>> unhandled Exception: java.lang.NullPointerException: null
>>          at
>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>
>>
>>          at
>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>
>>
>>
>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>> org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge threw an
>> exception.org.quartz.SchedulerException: Job threw an unhandled 
>> exception.
>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
>>          at
>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>
>>
>> Caused by: java.lang.NullPointerException: null
>>          at
>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>
>>
>>          at
>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>
>>
>>
>> Here is my cache configuration:
>>
>>      <cache>
>>        <aggregate>
>>          <method>avg</method>
>>          <useweekend>true</useweekend>
>>          <retention>
>>            <period>H</period>
>>            <offset>720</offset>
>>          </retention>
>>          <retention>
>>            <period>D</period>
>>            <offset>30</offset>
>>          </retention>
>>        </aggregate>
>>
>>        <purge>
>>          <offset>30</offset>
>>          <period>D</period>
>>        </purge>
>>      </cache>
>>
>> Regards,
>> Rahul.
>> On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
>>> Great if you can make a debian package, and I understand that you can
>>> not commit. The best thing would be integrated to our build process
>>> where we use ant.
>>>
>>> if the purging is based on time then it could happen that data is
>>> removed from the cache since the logic is based on time relative to
>>> now. To avoid it you should increase the purge time before you start
>>> bischeck. And just a comment on your last sentence Redis TTl is never
>>> used :)
>>> Anders
>>>
>>> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>>>> I would be more than happy to give you guys a testimonial. However, we
>>>> have just taken this live and would like to see its performance 
>>>> before I
>>>> give a testimonial.
>>>>
>>>> Also, if time permits, I'll try to bundle this for Debian (I'm a 
>>>> Debian
>>>> maintainer). I can't commit on a timeline right away though :).
>>>>
>>>> Also, just to make things explicitly clear. I understand that the 
>>>> below
>>>> service item ttl has nothing to do with Redis TTL. But If I stop my
>>>> bischeck server for a day or two, then would any of my metrics get 
>>>> lost?
>>>> Or would I have to increase th Redis TTL for this.
>>>>
>>>> Regards,
>>>> Rahul.
>>>>
>>>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>>>> Glad that it clarified how to configure the cache section. I will 
>>>>> make
>>>>> a blog post on this in the mean time, until we have a updated
>>>>> documentation. I agree with you that the structure of the
>>>>> configuration is a bit "heavy", so ideas and input is appreciated.
>>>>>
>>>>> Regarding redis ttl, this is a redis feature we do not use. The ttl
>>>>> mentioned in my mail is managed by bischeck. Redis ttl on linked list
>>>>> do not work on individual nodes in a redis linked list.
>>>>>
>>>>> Currently the bischeck installer should work for ubuntu, 
>>>>> redhat/centos
>>>>> and debian. There is currently no plans to make distribution packages
>>>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck 
>>>>> make a
>>>>> bischeck rpm. It would be super if there is any one that like to do
>>>>> this for the project.
>>>>> When it comes to packaging we have done a bit of work to create 
>>>>> docker
>>>>> containers, but its still experimental.
>>>>>
>>>>> I also encourage you, if you think bischeck support your monitoring
>>>>> effort, to write a small testimony that we can put on the site.
>>>>> Regards
>>>>> Anders
>>>>>
>>>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>>>> Thanks Anders. This explains precisely why my data was getting 
>>>>>> purged
>>>>>> after 16 hours (30 values per hour * 1 hours = 480). It would be 
>>>>>> great
>>>>>> if you could update the documentation with this info. The entire 
>>>>>> setup
>>>>>> and configuration itself takes time to get a hold on and detailed
>>>>>> documentation would be very helpful.
>>>>>>
>>>>>> Also, another quick question? Right now, I believe the Redis TTL is
>>>>>> set
>>>>>> to 2000 seconds. Does this mean that if I don't receive data for a
>>>>>> particular serviceitem (or service or host) for a 2000 seconds, the
>>>>>> data
>>>>>> related to it is lost?
>>>>>>
>>>>>> Also, any plans for bundling this with distributions such as Debian?
>>>>>>
>>>>>> Regards,
>>>>>> Rahul.
>>>>>>
>>>>>>
>>>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>>>> Hi Rahul,
>>>>>>> Thanks for the question and feedback on the documentation. Great to
>>>>>>> hear that you think Bischeck is awesome. If you do not 
>>>>>>> understand how
>>>>>>> it works by reading the documentation you are probably not 
>>>>>>> alone, and
>>>>>>> we should consider it a documentation bug.
>>>>>>>
>>>>>>> In 1.0.0 we introduce the concept that you asking about and it 
>>>>>>> really
>>>>>>> two different independent features.
>>>>>>>
>>>>>>> Lets start with cache purging.
>>>>>>> Collected monitoring data, metrics, are kept in the cache (redis 
>>>>>>> from
>>>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>>>> definition, like host1-service1-serviceitem1.  Prior to 1.0.0 
>>>>>>> all the
>>>>>>> linked lists had the same size that was defined with the property
>>>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable so it
>>>>>>> could be defined per service definition.
>>>>>>> To enable individual cache configurations we added a section called
>>>>>>> <cache> in the serviceitem section of the bischeck.xml. Like many
>>>>>>> other configuration options in 1.0.0 the cache section could 
>>>>>>> have the
>>>>>>> specific values or point to a template that could be shared.
>>>>>>> To manage the size of the cache , or to be more specific the linked
>>>>>>> list size, we defined the <purge> section. The purge section can 
>>>>>>> have
>>>>>>> two different configurations. The first is defining the max size of
>>>>>>> the cache linked list.
>>>>>>> <cache>
>>>>>>>   <purge>
>>>>>>>    <maxcount>1000</maxcount>
>>>>>>>   </purge>
>>>>>>> </cache>
>>>>>>>
>>>>>>> The second options is to define the “time to live” for the 
>>>>>>> metrics in
>>>>>>> the cache.
>>>>>>> <cache>
>>>>>>>   <purge>
>>>>>>>    <offset>10</offset>
>>>>>>>    <period>D</period>
>>>>>>>   </purge>
>>>>>>> </cache>
>>>>>>> In the above example we set the time to live to 10 days. So any
>>>>>>> metrics older then this period will be removed. The period can have
>>>>>>> the following values:
>>>>>>> H - hours
>>>>>>> D - days
>>>>>>> W - weeks
>>>>>>> Y - year
>>>>>>>
>>>>>>> The two option are mutual exclusive. You have to chose one for each
>>>>>>> serviceitem or cache template.
>>>>>>>
>>>>>>> If no cache directive is define for a serviceitem the property
>>>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>>>
>>>>>>> Hopefully this explains the cache purging.
>>>>>>>
>>>>>>> The next question was related to aggregations which has nothing 
>>>>>>> to do
>>>>>>> with purging, but it's configured in the same <cache> section. The
>>>>>>> idea with aggregations was to create an automatic way to aggregate
>>>>>>> metrics on the level of an hour, day, week and month. The 
>>>>>>> aggregation
>>>>>>> functions current supported is average, max and min.
>>>>>>> Lets say you have a service definition of the format
>>>>>>> host1-service1-serviceitem1. When you  enable an average (avg)
>>>>>>> aggregation you will automatically get the following new service
>>>>>>> definitions
>>>>>>> host1-service1/H/avg-serviceitem1
>>>>>>> host1-service1/D/avg-serviceitem1
>>>>>>> host1-service1/W/avg-serviceitem1
>>>>>>> host1-service1/M/avg-serviceitem1
>>>>>>>
>>>>>>> The configuration you need to achive the above average
>>>>>>> aggregations is:
>>>>>>> <cache>
>>>>>>>   <aggregate>
>>>>>>>     <method>avg</method>
>>>>>>>   </aggregate>
>>>>>>> </cache>
>>>>>>>
>>>>>>> If you like to combine it with the above descibed purging your
>>>>>>> configuration would look like:
>>>>>>> <cache>
>>>>>>>   <aggregate>
>>>>>>>     <method>avg</method>
>>>>>>>   </aggregate>
>>>>>>>
>>>>>>>   <purge>
>>>>>>>    <offset>10</offset>
>>>>>>>    <period>D</period>
>>>>>>>   </purge>
>>>>>>> </cache>
>>>>>>>
>>>>>>> The new aggregated service definitions,
>>>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own cache
>>>>>>> entries and can be used in threshold configurations and virtual
>>>>>>> services like any other service definitions. For example in a
>>>>>>> threshold hours section we could define
>>>>>>>
>>>>>>> <hours hoursID="2">
>>>>>>>
>>>>>>>   <hourinterval>
>>>>>>>     <from>09:00</from>
>>>>>>>     <to>12:00</to>
>>>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>>>>>>   </hourinterval>
>>>>>>>   ...
>>>>>>>
>>>>>>> This would mean that we use the average value for
>>>>>>> host1-service1-serviceitem1  for the period of the last hour.
>>>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>>>
>>>>>>> By default weekends metrics are not included in the aggrgation
>>>>>>> calculation. This can be enabled by setting the
>>>>>>> <useweekend>true</useweekend>:
>>>>>>>
>>>>>>> <cache>
>>>>>>>   <aggregate>
>>>>>>>     <method>avg</method>
>>>>>>>     <useweekend>true</useweekend>
>>>>>>>   </aggregate>
>>>>>>>   ….
>>>>>>> </cache>
>>>>>>>
>>>>>>> This will create aggregated service definitions with the following
>>>>>>> name standard:
>>>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>>>
>>>>>>> You can also have multiple entries like:
>>>>>>> <cache>
>>>>>>>   <aggregate>
>>>>>>>     <method>avg</method>
>>>>>>>     <useweekend>true</useweekend>
>>>>>>>   </aggregate>
>>>>>>>   <aggregate>
>>>>>>>     <method>max</method>
>>>>>>>   </aggregate>
>>>>>>>   ….
>>>>>>> </cache>
>>>>>>>
>>>>>>> So how long time will the aggregated values be kept in the 
>>>>>>> cache? By
>>>>>>> default we save
>>>>>>> Hour aggregation for 25 hours
>>>>>>> Daily aggregations for 7 days
>>>>>>> Weekly aggregations for 5 weeks
>>>>>>> Monthly aggregations for 1 month
>>>>>>>
>>>>>>> These values can be override but they can not be lower then the
>>>>>>> default. Below you have an example where we save the aggregation 
>>>>>>> for
>>>>>>> 168 hours, 60 days and 53 weeks.
>>>>>>> <cache>
>>>>>>>   <aggregate>
>>>>>>>     <method>avg</method>
>>>>>>>     <useweekend>true</useweekend>
>>>>>>>     <retention>
>>>>>>>       <period>H</period>
>>>>>>>       <offset>168</offset>
>>>>>>>     </retention>
>>>>>>>     <retention>
>>>>>>>      <period>D</period>
>>>>>>>       <offset>60</offset>
>>>>>>>     </retention>
>>>>>>>     <retention>
>>>>>>>       <period>W</period>
>>>>>>>       <offset>53</offset>
>>>>>>>     </retention>
>>>>>>> </aggregate>
>>>>>>>   ….
>>>>>>> </cache>
>>>>>>>
>>>>>>> I hope this makes it a bit less confusing. What is clear to me is
>>>>>>> that
>>>>>>> we need to improve the documentation in this area.
>>>>>>>
>>>>>>> Looking forward to your feedback.
>>>>>>> Anders
>>>>>>>
>>>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>>>> Hi,
>>>>>>>> I am trying to setup the bischeck plugin for our organization. I
>>>>>>>> have
>>>>>>>> configured most part of it except for the cache retention period.
>>>>>>>> Here
>>>>>>>> is what I want - I want to store every value which has been
>>>>>>>> generated
>>>>>>>> during the past 1 month. The reason being my threshold is 
>>>>>>>> currently
>>>>>>>> calculated as the average of the metric value during the past 4
>>>>>>>> weeks at
>>>>>>>> the same time of the day.
>>>>>>>>
>>>>>>>> So, how do I define the cache template for this? If I don't
>>>>>>>> define any
>>>>>>>> cache template, for how many days is the data kept?
>>>>>>>> Also, how does the aggregrate function work and and what does the
>>>>>>>> purge
>>>>>>>> Maxitems signify?
>>>>>>>>
>>>>>>>> I've gone through the documentation but it wasn't clear. Looking
>>>>>>>> forward
>>>>>>>> to a response.
>>>>>>>>
>>>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Rahul.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 
[image: adtech_mailer]


More information about the Bischeck-users mailing list