Specifying the retention period

Rahul Amaram rahul.amaram at vizury.com
Thu Sep 11 07:09:25 CEST 2014


Ok. So would $$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[24] 
refer to the average of the all the values ONLY in the 24th hour before 
the current time?

On Thursday 11 September 2014 10:30 AM, Anders Håål wrote:
> Hi Amaram,
> I think you just need to remove the minus sign when using the 
> aggregated. Minus is used for time, like back in time, and just a 
> integer without minus and a time indicator is an index. Check out 
> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_configuration_guide.html#toc-Chapter-4. 
>
> You can also use redis-cli to explore the data in the cache. The key 
> in the redis is the same as the service definition.
> Anders
>
> On 09/11/2014 06:38 AM, Rahul Amaram wrote:
>> Ok. I am facing another issue. I have been running bischeck with the 
>> aggregate function for more than a day. I am using the below 
>> threshold function.
>>
>> <threshold>avg($$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-24],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-168],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-336])</threshold> 
>>
>>
>> and it doesn't seem to work. I am expecting that the first aggregate 
>> value should be available.
>>
>> Instead if I use the below threshold function (I know this is not 
>> related to aggregate)
>>
>> avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H]) 
>>
>>
>> the threshold is calcuated fine, which is just the first value as the 
>> remaining two values are not in cache.
>>
>> How can I debug why aggregate is not working?
>>
>> Thanks,
>> Rahul.
>>
>> On Wednesday 10 September 2014 04:53 PM, Anders Håål wrote:
>>> Thanks - got the ticket.
>>> I will update progress on the bug ticket, but its good that the work 
>>> around works.
>>> Anders
>>>
>>> On 09/10/2014 01:20 PM, Rahul Amaram wrote:
>>>> That indeed seems to be the problem. Using count rather than period
>>>> seems to address the issue. Raised a ticket -
>>>> http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=259 
>>>>
>>>> .
>>>>
>>>> Thanks,
>>>> Rahul.
>>>>
>>>> On Wednesday 10 September 2014 04:02 PM, Anders Håål wrote:
>>>>> This looks like a bug. Could you please report it on
>>>>> http://gforge.ingby.com/gf/project/bischeck/tracker/ in the Bugs
>>>>> tracker. You need a account but its just a sign up and you get an
>>>>> email confirmation.
>>>>> Can you try to use maxcount for purging instead as a work around? 
>>>>> Just
>>>>> calculate your maxcount based on the scheduling interval you use.
>>>>> Anders
>>>>>
>>>>> On 09/10/2014 12:17 PM, Rahul Amaram wrote:
>>>>>> Following up on the earlier topic, I am seeing the below errors 
>>>>>> related
>>>>>> to cache purge. Any idea on what might be causing this? I don't 
>>>>>> see any
>>>>>> other errors in log related to metrics.
>>>>>>
>>>>>> 2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>>>> purging 180
>>>>>> 2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>>>> executed in 1 ms
>>>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>>>> org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge 
>>>>>> threw an
>>>>>> unhandled Exception: java.lang.NullPointerException: null
>>>>>>          at
>>>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>>>>>
>>>>>>
>>>>>>
>>>>>>          at
>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>>>> org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge 
>>>>>> threw an
>>>>>> exception.org.quartz.SchedulerException: Job threw an unhandled
>>>>>> exception.
>>>>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
>>>>>>          at
>>>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>>>>
>>>>>>
>>>>>>
>>>>>> Caused by: java.lang.NullPointerException: null
>>>>>>          at
>>>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>>>>>
>>>>>>
>>>>>>
>>>>>>          at
>>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here is my cache configuration:
>>>>>>
>>>>>>      <cache>
>>>>>>        <aggregate>
>>>>>>          <method>avg</method>
>>>>>>          <useweekend>true</useweekend>
>>>>>>          <retention>
>>>>>>            <period>H</period>
>>>>>>            <offset>720</offset>
>>>>>>          </retention>
>>>>>>          <retention>
>>>>>>            <period>D</period>
>>>>>>            <offset>30</offset>
>>>>>>          </retention>
>>>>>>        </aggregate>
>>>>>>
>>>>>>        <purge>
>>>>>>          <offset>30</offset>
>>>>>>          <period>D</period>
>>>>>>        </purge>
>>>>>>      </cache>
>>>>>>
>>>>>> Regards,
>>>>>> Rahul.
>>>>>> On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
>>>>>>> Great if you can make a debian package, and I understand that 
>>>>>>> you can
>>>>>>> not commit. The best thing would be integrated to our build process
>>>>>>> where we use ant.
>>>>>>>
>>>>>>> if the purging is based on time then it could happen that data is
>>>>>>> removed from the cache since the logic is based on time relative to
>>>>>>> now. To avoid it you should increase the purge time before you 
>>>>>>> start
>>>>>>> bischeck. And just a comment on your last sentence Redis TTl is 
>>>>>>> never
>>>>>>> used :)
>>>>>>> Anders
>>>>>>>
>>>>>>> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>>>>>>>> I would be more than happy to give you guys a testimonial. 
>>>>>>>> However, we
>>>>>>>> have just taken this live and would like to see its performance
>>>>>>>> before I
>>>>>>>> give a testimonial.
>>>>>>>>
>>>>>>>> Also, if time permits, I'll try to bundle this for Debian (I'm a
>>>>>>>> Debian
>>>>>>>> maintainer). I can't commit on a timeline right away though :).
>>>>>>>>
>>>>>>>> Also, just to make things explicitly clear. I understand that the
>>>>>>>> below
>>>>>>>> service item ttl has nothing to do with Redis TTL. But If I 
>>>>>>>> stop my
>>>>>>>> bischeck server for a day or two, then would any of my metrics get
>>>>>>>> lost?
>>>>>>>> Or would I have to increase th Redis TTL for this.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Rahul.
>>>>>>>>
>>>>>>>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>>>>>>>> Glad that it clarified how to configure the cache section. I will
>>>>>>>>> make
>>>>>>>>> a blog post on this in the mean time, until we have a updated
>>>>>>>>> documentation. I agree with you that the structure of the
>>>>>>>>> configuration is a bit "heavy", so ideas and input is 
>>>>>>>>> appreciated.
>>>>>>>>>
>>>>>>>>> Regarding redis ttl, this is a redis feature we do not use. 
>>>>>>>>> The ttl
>>>>>>>>> mentioned in my mail is managed by bischeck. Redis ttl on 
>>>>>>>>> linked list
>>>>>>>>> do not work on individual nodes in a redis linked list.
>>>>>>>>>
>>>>>>>>> Currently the bischeck installer should work for ubuntu,
>>>>>>>>> redhat/centos
>>>>>>>>> and debian. There is currently no plans to make distribution 
>>>>>>>>> packages
>>>>>>>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck
>>>>>>>>> make a
>>>>>>>>> bischeck rpm. It would be super if there is any one that like 
>>>>>>>>> to do
>>>>>>>>> this for the project.
>>>>>>>>> When it comes to packaging we have done a bit of work to create
>>>>>>>>> docker
>>>>>>>>> containers, but its still experimental.
>>>>>>>>>
>>>>>>>>> I also encourage you, if you think bischeck support your 
>>>>>>>>> monitoring
>>>>>>>>> effort, to write a small testimony that we can put on the site.
>>>>>>>>> Regards
>>>>>>>>> Anders
>>>>>>>>>
>>>>>>>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>>>>>>>> Thanks Anders. This explains precisely why my data was getting
>>>>>>>>>> purged
>>>>>>>>>> after 16 hours (30 values per hour * 1 hours = 480). It would be
>>>>>>>>>> great
>>>>>>>>>> if you could update the documentation with this info. The entire
>>>>>>>>>> setup
>>>>>>>>>> and configuration itself takes time to get a hold on and 
>>>>>>>>>> detailed
>>>>>>>>>> documentation would be very helpful.
>>>>>>>>>>
>>>>>>>>>> Also, another quick question? Right now, I believe the Redis 
>>>>>>>>>> TTL is
>>>>>>>>>> set
>>>>>>>>>> to 2000 seconds. Does this mean that if I don't receive data 
>>>>>>>>>> for a
>>>>>>>>>> particular serviceitem (or service or host) for a 2000 
>>>>>>>>>> seconds, the
>>>>>>>>>> data
>>>>>>>>>> related to it is lost?
>>>>>>>>>>
>>>>>>>>>> Also, any plans for bundling this with distributions such as 
>>>>>>>>>> Debian?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Rahul.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>>>>>>>> Hi Rahul,
>>>>>>>>>>> Thanks for the question and feedback on the documentation. 
>>>>>>>>>>> Great to
>>>>>>>>>>> hear that you think Bischeck is awesome. If you do not
>>>>>>>>>>> understand how
>>>>>>>>>>> it works by reading the documentation you are probably not
>>>>>>>>>>> alone, and
>>>>>>>>>>> we should consider it a documentation bug.
>>>>>>>>>>>
>>>>>>>>>>> In 1.0.0 we introduce the concept that you asking about and it
>>>>>>>>>>> really
>>>>>>>>>>> two different independent features.
>>>>>>>>>>>
>>>>>>>>>>> Lets start with cache purging.
>>>>>>>>>>> Collected monitoring data, metrics, are kept in the cache 
>>>>>>>>>>> (redis
>>>>>>>>>>> from
>>>>>>>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>>>>>>>> definition, like host1-service1-serviceitem1. Prior to 1.0.0
>>>>>>>>>>> all the
>>>>>>>>>>> linked lists had the same size that was defined with the 
>>>>>>>>>>> property
>>>>>>>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable 
>>>>>>>>>>> so it
>>>>>>>>>>> could be defined per service definition.
>>>>>>>>>>> To enable individual cache configurations we added a section 
>>>>>>>>>>> called
>>>>>>>>>>> <cache> in the serviceitem section of the bischeck.xml. Like 
>>>>>>>>>>> many
>>>>>>>>>>> other configuration options in 1.0.0 the cache section could
>>>>>>>>>>> have the
>>>>>>>>>>> specific values or point to a template that could be shared.
>>>>>>>>>>> To manage the size of the cache , or to be more specific the 
>>>>>>>>>>> linked
>>>>>>>>>>> list size, we defined the <purge> section. The purge section 
>>>>>>>>>>> can
>>>>>>>>>>> have
>>>>>>>>>>> two different configurations. The first is defining the max 
>>>>>>>>>>> size of
>>>>>>>>>>> the cache linked list.
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <purge>
>>>>>>>>>>>    <maxcount>1000</maxcount>
>>>>>>>>>>>   </purge>
>>>>>>>>>>> </cache>
>>>>>>>>>>>
>>>>>>>>>>> The second options is to define the “time to live” for the
>>>>>>>>>>> metrics in
>>>>>>>>>>> the cache.
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <purge>
>>>>>>>>>>>    <offset>10</offset>
>>>>>>>>>>>    <period>D</period>
>>>>>>>>>>>   </purge>
>>>>>>>>>>> </cache>
>>>>>>>>>>> In the above example we set the time to live to 10 days. So any
>>>>>>>>>>> metrics older then this period will be removed. The period 
>>>>>>>>>>> can have
>>>>>>>>>>> the following values:
>>>>>>>>>>> H - hours
>>>>>>>>>>> D - days
>>>>>>>>>>> W - weeks
>>>>>>>>>>> Y - year
>>>>>>>>>>>
>>>>>>>>>>> The two option are mutual exclusive. You have to chose one 
>>>>>>>>>>> for each
>>>>>>>>>>> serviceitem or cache template.
>>>>>>>>>>>
>>>>>>>>>>> If no cache directive is define for a serviceitem the property
>>>>>>>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this explains the cache purging.
>>>>>>>>>>>
>>>>>>>>>>> The next question was related to aggregations which has nothing
>>>>>>>>>>> to do
>>>>>>>>>>> with purging, but it's configured in the same <cache> 
>>>>>>>>>>> section. The
>>>>>>>>>>> idea with aggregations was to create an automatic way to 
>>>>>>>>>>> aggregate
>>>>>>>>>>> metrics on the level of an hour, day, week and month. The
>>>>>>>>>>> aggregation
>>>>>>>>>>> functions current supported is average, max and min.
>>>>>>>>>>> Lets say you have a service definition of the format
>>>>>>>>>>> host1-service1-serviceitem1. When you  enable an average (avg)
>>>>>>>>>>> aggregation you will automatically get the following new 
>>>>>>>>>>> service
>>>>>>>>>>> definitions
>>>>>>>>>>> host1-service1/H/avg-serviceitem1
>>>>>>>>>>> host1-service1/D/avg-serviceitem1
>>>>>>>>>>> host1-service1/W/avg-serviceitem1
>>>>>>>>>>> host1-service1/M/avg-serviceitem1
>>>>>>>>>>>
>>>>>>>>>>> The configuration you need to achive the above average
>>>>>>>>>>> aggregations is:
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>   </aggregate>
>>>>>>>>>>> </cache>
>>>>>>>>>>>
>>>>>>>>>>> If you like to combine it with the above descibed purging your
>>>>>>>>>>> configuration would look like:
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>
>>>>>>>>>>>   <purge>
>>>>>>>>>>>    <offset>10</offset>
>>>>>>>>>>>    <period>D</period>
>>>>>>>>>>>   </purge>
>>>>>>>>>>> </cache>
>>>>>>>>>>>
>>>>>>>>>>> The new aggregated service definitions,
>>>>>>>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own 
>>>>>>>>>>> cache
>>>>>>>>>>> entries and can be used in threshold configurations and virtual
>>>>>>>>>>> services like any other service definitions. For example in a
>>>>>>>>>>> threshold hours section we could define
>>>>>>>>>>>
>>>>>>>>>>> <hours hoursID="2">
>>>>>>>>>>>
>>>>>>>>>>>   <hourinterval>
>>>>>>>>>>>     <from>09:00</from>
>>>>>>>>>>>     <to>12:00</to>
>>>>>>>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>>>>>>>>>>   </hourinterval>
>>>>>>>>>>>   ...
>>>>>>>>>>>
>>>>>>>>>>> This would mean that we use the average value for
>>>>>>>>>>> host1-service1-serviceitem1  for the period of the last hour.
>>>>>>>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>>>>>>>
>>>>>>>>>>> By default weekends metrics are not included in the aggrgation
>>>>>>>>>>> calculation. This can be enabled by setting the
>>>>>>>>>>> <useweekend>true</useweekend>:
>>>>>>>>>>>
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>   ….
>>>>>>>>>>> </cache>
>>>>>>>>>>>
>>>>>>>>>>> This will create aggregated service definitions with the 
>>>>>>>>>>> following
>>>>>>>>>>> name standard:
>>>>>>>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>>>>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>>>>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>>>>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>>>>>>>
>>>>>>>>>>> You can also have multiple entries like:
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>     <method>max</method>
>>>>>>>>>>>   </aggregate>
>>>>>>>>>>>   ….
>>>>>>>>>>> </cache>
>>>>>>>>>>>
>>>>>>>>>>> So how long time will the aggregated values be kept in the
>>>>>>>>>>> cache? By
>>>>>>>>>>> default we save
>>>>>>>>>>> Hour aggregation for 25 hours
>>>>>>>>>>> Daily aggregations for 7 days
>>>>>>>>>>> Weekly aggregations for 5 weeks
>>>>>>>>>>> Monthly aggregations for 1 month
>>>>>>>>>>>
>>>>>>>>>>> These values can be override but they can not be lower then the
>>>>>>>>>>> default. Below you have an example where we save the 
>>>>>>>>>>> aggregation
>>>>>>>>>>> for
>>>>>>>>>>> 168 hours, 60 days and 53 weeks.
>>>>>>>>>>> <cache>
>>>>>>>>>>>   <aggregate>
>>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>>>>     <retention>
>>>>>>>>>>>       <period>H</period>
>>>>>>>>>>>       <offset>168</offset>
>>>>>>>>>>>     </retention>
>>>>>>>>>>>     <retention>
>>>>>>>>>>>      <period>D</period>
>>>>>>>>>>>       <offset>60</offset>
>>>>>>>>>>>     </retention>
>>>>>>>>>>>     <retention>
>>>>>>>>>>>       <period>W</period>
>>>>>>>>>>>       <offset>53</offset>
>>>>>>>>>>>     </retention>
>>>>>>>>>>> </aggregate>
>>>>>>>>>>>   ….
>>>>>>>>>>> </cache>
>>>>>>>>>>>
>>>>>>>>>>> I hope this makes it a bit less confusing. What is clear to 
>>>>>>>>>>> me is
>>>>>>>>>>> that
>>>>>>>>>>> we need to improve the documentation in this area.
>>>>>>>>>>>
>>>>>>>>>>> Looking forward to your feedback.
>>>>>>>>>>> Anders
>>>>>>>>>>>
>>>>>>>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I am trying to setup the bischeck plugin for our 
>>>>>>>>>>>> organization. I
>>>>>>>>>>>> have
>>>>>>>>>>>> configured most part of it except for the cache retention 
>>>>>>>>>>>> period.
>>>>>>>>>>>> Here
>>>>>>>>>>>> is what I want - I want to store every value which has been
>>>>>>>>>>>> generated
>>>>>>>>>>>> during the past 1 month. The reason being my threshold is
>>>>>>>>>>>> currently
>>>>>>>>>>>> calculated as the average of the metric value during the 
>>>>>>>>>>>> past 4
>>>>>>>>>>>> weeks at
>>>>>>>>>>>> the same time of the day.
>>>>>>>>>>>>
>>>>>>>>>>>> So, how do I define the cache template for this? If I don't
>>>>>>>>>>>> define any
>>>>>>>>>>>> cache template, for how many days is the data kept?
>>>>>>>>>>>> Also, how does the aggregrate function work and and what 
>>>>>>>>>>>> does the
>>>>>>>>>>>> purge
>>>>>>>>>>>> Maxitems signify?
>>>>>>>>>>>>
>>>>>>>>>>>> I've gone through the documentation but it wasn't clear. 
>>>>>>>>>>>> Looking
>>>>>>>>>>>> forward
>>>>>>>>>>>> to a response.
>>>>>>>>>>>>
>>>>>>>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Rahul.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 
[image: adtech_mailer]


More information about the Bischeck-users mailing list