Specifying the retention period

Anders Håål anders.haal at ingby.com
Wed Sep 10 13:23:50 CEST 2014


Thanks - got the ticket.
I will update progress on the bug ticket, but its good that the work 
around works.
Anders

On 09/10/2014 01:20 PM, Rahul Amaram wrote:
> That indeed seems to be the problem. Using count rather than period
> seems to address the issue. Raised a ticket -
> http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=259
> .
>
> Thanks,
> Rahul.
>
> On Wednesday 10 September 2014 04:02 PM, Anders Håål wrote:
>> This looks like a bug. Could you please report it on
>> http://gforge.ingby.com/gf/project/bischeck/tracker/ in the Bugs
>> tracker. You need a account but its just a sign up and you get an
>> email confirmation.
>> Can you try to use maxcount for purging instead as a work around? Just
>> calculate your maxcount based on the scheduling interval you use.
>> Anders
>>
>> On 09/10/2014 12:17 PM, Rahul Amaram wrote:
>>> Following up on the earlier topic, I am seeing the below errors related
>>> to cache purge. Any idea on what might be causing this? I don't see any
>>> other errors in log related to metrics.
>>>
>>> 2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>> purging 180
>>> 2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>> executed in 1 ms
>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>> org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge threw an
>>> unhandled Exception: java.lang.NullPointerException: null
>>>          at
>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250)
>>>
>>>
>>>          at
>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140)
>>>
>>>
>>>
>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>> org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge threw an
>>> exception.org.quartz.SchedulerException: Job threw an unhandled
>>> exception.
>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
>>>          at
>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557)
>>>
>>>
>>> Caused by: java.lang.NullPointerException: null
>>>          at
>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250)
>>>
>>>
>>>          at
>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140)
>>>
>>>
>>>
>>> Here is my cache configuration:
>>>
>>>      <cache>
>>>        <aggregate>
>>>          <method>avg</method>
>>>          <useweekend>true</useweekend>
>>>          <retention>
>>>            <period>H</period>
>>>            <offset>720</offset>
>>>          </retention>
>>>          <retention>
>>>            <period>D</period>
>>>            <offset>30</offset>
>>>          </retention>
>>>        </aggregate>
>>>
>>>        <purge>
>>>          <offset>30</offset>
>>>          <period>D</period>
>>>        </purge>
>>>      </cache>
>>>
>>> Regards,
>>> Rahul.
>>> On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
>>>> Great if you can make a debian package, and I understand that you can
>>>> not commit. The best thing would be integrated to our build process
>>>> where we use ant.
>>>>
>>>> if the purging is based on time then it could happen that data is
>>>> removed from the cache since the logic is based on time relative to
>>>> now. To avoid it you should increase the purge time before you start
>>>> bischeck. And just a comment on your last sentence Redis TTl is never
>>>> used :)
>>>> Anders
>>>>
>>>> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>>>>> I would be more than happy to give you guys a testimonial. However, we
>>>>> have just taken this live and would like to see its performance
>>>>> before I
>>>>> give a testimonial.
>>>>>
>>>>> Also, if time permits, I'll try to bundle this for Debian (I'm a
>>>>> Debian
>>>>> maintainer). I can't commit on a timeline right away though :).
>>>>>
>>>>> Also, just to make things explicitly clear. I understand that the
>>>>> below
>>>>> service item ttl has nothing to do with Redis TTL. But If I stop my
>>>>> bischeck server for a day or two, then would any of my metrics get
>>>>> lost?
>>>>> Or would I have to increase th Redis TTL for this.
>>>>>
>>>>> Regards,
>>>>> Rahul.
>>>>>
>>>>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>>>>> Glad that it clarified how to configure the cache section. I will
>>>>>> make
>>>>>> a blog post on this in the mean time, until we have a updated
>>>>>> documentation. I agree with you that the structure of the
>>>>>> configuration is a bit "heavy", so ideas and input is appreciated.
>>>>>>
>>>>>> Regarding redis ttl, this is a redis feature we do not use. The ttl
>>>>>> mentioned in my mail is managed by bischeck. Redis ttl on linked list
>>>>>> do not work on individual nodes in a redis linked list.
>>>>>>
>>>>>> Currently the bischeck installer should work for ubuntu,
>>>>>> redhat/centos
>>>>>> and debian. There is currently no plans to make distribution packages
>>>>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck
>>>>>> make a
>>>>>> bischeck rpm. It would be super if there is any one that like to do
>>>>>> this for the project.
>>>>>> When it comes to packaging we have done a bit of work to create
>>>>>> docker
>>>>>> containers, but its still experimental.
>>>>>>
>>>>>> I also encourage you, if you think bischeck support your monitoring
>>>>>> effort, to write a small testimony that we can put on the site.
>>>>>> Regards
>>>>>> Anders
>>>>>>
>>>>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>>>>> Thanks Anders. This explains precisely why my data was getting
>>>>>>> purged
>>>>>>> after 16 hours (30 values per hour * 1 hours = 480). It would be
>>>>>>> great
>>>>>>> if you could update the documentation with this info. The entire
>>>>>>> setup
>>>>>>> and configuration itself takes time to get a hold on and detailed
>>>>>>> documentation would be very helpful.
>>>>>>>
>>>>>>> Also, another quick question? Right now, I believe the Redis TTL is
>>>>>>> set
>>>>>>> to 2000 seconds. Does this mean that if I don't receive data for a
>>>>>>> particular serviceitem (or service or host) for a 2000 seconds, the
>>>>>>> data
>>>>>>> related to it is lost?
>>>>>>>
>>>>>>> Also, any plans for bundling this with distributions such as Debian?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rahul.
>>>>>>>
>>>>>>>
>>>>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>>>>> Hi Rahul,
>>>>>>>> Thanks for the question and feedback on the documentation. Great to
>>>>>>>> hear that you think Bischeck is awesome. If you do not
>>>>>>>> understand how
>>>>>>>> it works by reading the documentation you are probably not
>>>>>>>> alone, and
>>>>>>>> we should consider it a documentation bug.
>>>>>>>>
>>>>>>>> In 1.0.0 we introduce the concept that you asking about and it
>>>>>>>> really
>>>>>>>> two different independent features.
>>>>>>>>
>>>>>>>> Lets start with cache purging.
>>>>>>>> Collected monitoring data, metrics, are kept in the cache (redis
>>>>>>>> from
>>>>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>>>>> definition, like host1-service1-serviceitem1.  Prior to 1.0.0
>>>>>>>> all the
>>>>>>>> linked lists had the same size that was defined with the property
>>>>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable so it
>>>>>>>> could be defined per service definition.
>>>>>>>> To enable individual cache configurations we added a section called
>>>>>>>> <cache> in the serviceitem section of the bischeck.xml. Like many
>>>>>>>> other configuration options in 1.0.0 the cache section could
>>>>>>>> have the
>>>>>>>> specific values or point to a template that could be shared.
>>>>>>>> To manage the size of the cache , or to be more specific the linked
>>>>>>>> list size, we defined the <purge> section. The purge section can
>>>>>>>> have
>>>>>>>> two different configurations. The first is defining the max size of
>>>>>>>> the cache linked list.
>>>>>>>> <cache>
>>>>>>>>   <purge>
>>>>>>>>    <maxcount>1000</maxcount>
>>>>>>>>   </purge>
>>>>>>>> </cache>
>>>>>>>>
>>>>>>>> The second options is to define the “time to live” for the
>>>>>>>> metrics in
>>>>>>>> the cache.
>>>>>>>> <cache>
>>>>>>>>   <purge>
>>>>>>>>    <offset>10</offset>
>>>>>>>>    <period>D</period>
>>>>>>>>   </purge>
>>>>>>>> </cache>
>>>>>>>> In the above example we set the time to live to 10 days. So any
>>>>>>>> metrics older then this period will be removed. The period can have
>>>>>>>> the following values:
>>>>>>>> H - hours
>>>>>>>> D - days
>>>>>>>> W - weeks
>>>>>>>> Y - year
>>>>>>>>
>>>>>>>> The two option are mutual exclusive. You have to chose one for each
>>>>>>>> serviceitem or cache template.
>>>>>>>>
>>>>>>>> If no cache directive is define for a serviceitem the property
>>>>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>>>>
>>>>>>>> Hopefully this explains the cache purging.
>>>>>>>>
>>>>>>>> The next question was related to aggregations which has nothing
>>>>>>>> to do
>>>>>>>> with purging, but it's configured in the same <cache> section. The
>>>>>>>> idea with aggregations was to create an automatic way to aggregate
>>>>>>>> metrics on the level of an hour, day, week and month. The
>>>>>>>> aggregation
>>>>>>>> functions current supported is average, max and min.
>>>>>>>> Lets say you have a service definition of the format
>>>>>>>> host1-service1-serviceitem1. When you  enable an average (avg)
>>>>>>>> aggregation you will automatically get the following new service
>>>>>>>> definitions
>>>>>>>> host1-service1/H/avg-serviceitem1
>>>>>>>> host1-service1/D/avg-serviceitem1
>>>>>>>> host1-service1/W/avg-serviceitem1
>>>>>>>> host1-service1/M/avg-serviceitem1
>>>>>>>>
>>>>>>>> The configuration you need to achive the above average
>>>>>>>> aggregations is:
>>>>>>>> <cache>
>>>>>>>>   <aggregate>
>>>>>>>>     <method>avg</method>
>>>>>>>>   </aggregate>
>>>>>>>> </cache>
>>>>>>>>
>>>>>>>> If you like to combine it with the above descibed purging your
>>>>>>>> configuration would look like:
>>>>>>>> <cache>
>>>>>>>>   <aggregate>
>>>>>>>>     <method>avg</method>
>>>>>>>>   </aggregate>
>>>>>>>>
>>>>>>>>   <purge>
>>>>>>>>    <offset>10</offset>
>>>>>>>>    <period>D</period>
>>>>>>>>   </purge>
>>>>>>>> </cache>
>>>>>>>>
>>>>>>>> The new aggregated service definitions,
>>>>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own cache
>>>>>>>> entries and can be used in threshold configurations and virtual
>>>>>>>> services like any other service definitions. For example in a
>>>>>>>> threshold hours section we could define
>>>>>>>>
>>>>>>>> <hours hoursID="2">
>>>>>>>>
>>>>>>>>   <hourinterval>
>>>>>>>>     <from>09:00</from>
>>>>>>>>     <to>12:00</to>
>>>>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>>>>>>>   </hourinterval>
>>>>>>>>   ...
>>>>>>>>
>>>>>>>> This would mean that we use the average value for
>>>>>>>> host1-service1-serviceitem1  for the period of the last hour.
>>>>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>>>>
>>>>>>>> By default weekends metrics are not included in the aggrgation
>>>>>>>> calculation. This can be enabled by setting the
>>>>>>>> <useweekend>true</useweekend>:
>>>>>>>>
>>>>>>>> <cache>
>>>>>>>>   <aggregate>
>>>>>>>>     <method>avg</method>
>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>   </aggregate>
>>>>>>>>   ….
>>>>>>>> </cache>
>>>>>>>>
>>>>>>>> This will create aggregated service definitions with the following
>>>>>>>> name standard:
>>>>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>>>>
>>>>>>>> You can also have multiple entries like:
>>>>>>>> <cache>
>>>>>>>>   <aggregate>
>>>>>>>>     <method>avg</method>
>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>   </aggregate>
>>>>>>>>   <aggregate>
>>>>>>>>     <method>max</method>
>>>>>>>>   </aggregate>
>>>>>>>>   ….
>>>>>>>> </cache>
>>>>>>>>
>>>>>>>> So how long time will the aggregated values be kept in the
>>>>>>>> cache? By
>>>>>>>> default we save
>>>>>>>> Hour aggregation for 25 hours
>>>>>>>> Daily aggregations for 7 days
>>>>>>>> Weekly aggregations for 5 weeks
>>>>>>>> Monthly aggregations for 1 month
>>>>>>>>
>>>>>>>> These values can be override but they can not be lower then the
>>>>>>>> default. Below you have an example where we save the aggregation
>>>>>>>> for
>>>>>>>> 168 hours, 60 days and 53 weeks.
>>>>>>>> <cache>
>>>>>>>>   <aggregate>
>>>>>>>>     <method>avg</method>
>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>     <retention>
>>>>>>>>       <period>H</period>
>>>>>>>>       <offset>168</offset>
>>>>>>>>     </retention>
>>>>>>>>     <retention>
>>>>>>>>      <period>D</period>
>>>>>>>>       <offset>60</offset>
>>>>>>>>     </retention>
>>>>>>>>     <retention>
>>>>>>>>       <period>W</period>
>>>>>>>>       <offset>53</offset>
>>>>>>>>     </retention>
>>>>>>>> </aggregate>
>>>>>>>>   ….
>>>>>>>> </cache>
>>>>>>>>
>>>>>>>> I hope this makes it a bit less confusing. What is clear to me is
>>>>>>>> that
>>>>>>>> we need to improve the documentation in this area.
>>>>>>>>
>>>>>>>> Looking forward to your feedback.
>>>>>>>> Anders
>>>>>>>>
>>>>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>>>>> Hi,
>>>>>>>>> I am trying to setup the bischeck plugin for our organization. I
>>>>>>>>> have
>>>>>>>>> configured most part of it except for the cache retention period.
>>>>>>>>> Here
>>>>>>>>> is what I want - I want to store every value which has been
>>>>>>>>> generated
>>>>>>>>> during the past 1 month. The reason being my threshold is
>>>>>>>>> currently
>>>>>>>>> calculated as the average of the metric value during the past 4
>>>>>>>>> weeks at
>>>>>>>>> the same time of the day.
>>>>>>>>>
>>>>>>>>> So, how do I define the cache template for this? If I don't
>>>>>>>>> define any
>>>>>>>>> cache template, for how many days is the data kept?
>>>>>>>>> Also, how does the aggregrate function work and and what does the
>>>>>>>>> purge
>>>>>>>>> Maxitems signify?
>>>>>>>>>
>>>>>>>>> I've gone through the documentation but it wasn't clear. Looking
>>>>>>>>> forward
>>>>>>>>> to a response.
>>>>>>>>>
>>>>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Rahul.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 

Ingby<http://www.ingby.com>

IngbyForge<http://gforge.ingby.com>

bischeck - dynamic and adaptive thresholds for Nagios 
<http://www.bischeck.org>

anders.haal at ingby.com<mailto:anders.haal at ingby.com>

Mjukvara genom ingenjörsmässig kreativitet och kompetens

Ingenjörsbyn
Box 531
101 30 Stockholm
Sweden
www.ingby.com <http://www.ingby.com/>
Mobil: +46 70 575 35 46
Tele: +46 75 75 75 090
Fax:  +46 75 75 75 091



More information about the Bischeck-users mailing list