Specifying the retention period

Anders Håål anders.haal at ingby.com
Thu Sep 11 07:00:25 CEST 2014


Hi Amaram,
I think you just need to remove the minus sign when using the 
aggregated. Minus is used for time, like back in time, and just a 
integer without minus and a time indicator is an index. Check out 
http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_configuration_guide.html#toc-Chapter-4. 

You can also use redis-cli to explore the data in the cache. The key in 
the redis is the same as the service definition.
Anders

On 09/11/2014 06:38 AM, Rahul Amaram wrote:
> Ok. I am facing another issue. I have been running bischeck with the 
> aggregate function for more than a day. I am using the below threshold 
> function.
>
> <threshold>avg($$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-24],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-168],$$HOSTNAME$$-$$SERVICENAME$$/H/avg-$$SERVICEITEMNAME$$[-336])</threshold> 
>
>
> and it doesn't seem to work. I am expecting that the first aggregate 
> value should be available.
>
> Instead if I use the below threshold function (I know this is not 
> related to aggregate)
>
> avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H]) 
>
>
> the threshold is calcuated fine, which is just the first value as the 
> remaining two values are not in cache.
>
> How can I debug why aggregate is not working?
>
> Thanks,
> Rahul.
>
> On Wednesday 10 September 2014 04:53 PM, Anders Håål wrote:
>> Thanks - got the ticket.
>> I will update progress on the bug ticket, but its good that the work 
>> around works.
>> Anders
>>
>> On 09/10/2014 01:20 PM, Rahul Amaram wrote:
>>> That indeed seems to be the problem. Using count rather than period
>>> seems to address the issue. Raised a ticket -
>>> http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=259 
>>>
>>> .
>>>
>>> Thanks,
>>> Rahul.
>>>
>>> On Wednesday 10 September 2014 04:02 PM, Anders Håål wrote:
>>>> This looks like a bug. Could you please report it on
>>>> http://gforge.ingby.com/gf/project/bischeck/tracker/ in the Bugs
>>>> tracker. You need a account but its just a sign up and you get an
>>>> email confirmation.
>>>> Can you try to use maxcount for purging instead as a work around? Just
>>>> calculate your maxcount based on the scheduling interval you use.
>>>> Anders
>>>>
>>>> On 09/10/2014 12:17 PM, Rahul Amaram wrote:
>>>>> Following up on the earlier topic, I am seeing the below errors 
>>>>> related
>>>>> to cache purge. Any idea on what might be causing this? I don't 
>>>>> see any
>>>>> other errors in log related to metrics.
>>>>>
>>>>> 2014-09-10 12:12:00.001 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>>> purging 180
>>>>> 2014-09-10 12:12:00.003 ; INFO ; DefaultQuartzScheduler_Worker-5 ;
>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob ; CachePurge
>>>>> executed in 1 ms
>>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>>> org.quartz.core.JobRunShell ; Job DailyMaintenance.CachePurge 
>>>>> threw an
>>>>> unhandled Exception: java.lang.NullPointerException: null
>>>>>          at
>>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>>>>
>>>>>
>>>>>
>>>>>          at
>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2014-09-10 12:12:00.003 ; ERROR ; DefaultQuartzScheduler_Worker-5 ;
>>>>> org.quartz.core.ErrorLogger ; Job (DailyMaintenance.CachePurge 
>>>>> threw an
>>>>> exception.org.quartz.SchedulerException: Job threw an unhandled
>>>>> exception.
>>>>>          at org.quartz.core.JobRunShell.run(JobRunShell.java:224)
>>>>>          at
>>>>> org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) 
>>>>>
>>>>>
>>>>>
>>>>> Caused by: java.lang.NullPointerException: null
>>>>>          at
>>>>> com.ingby.socbox.bischeck.cache.provider.redis.LastStatusCache.trim(LastStatusCache.java:1250) 
>>>>>
>>>>>
>>>>>
>>>>>          at
>>>>> com.ingby.socbox.bischeck.configuration.CachePurgeJob.execute(CachePurgeJob.java:140) 
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Here is my cache configuration:
>>>>>
>>>>>      <cache>
>>>>>        <aggregate>
>>>>>          <method>avg</method>
>>>>>          <useweekend>true</useweekend>
>>>>>          <retention>
>>>>>            <period>H</period>
>>>>>            <offset>720</offset>
>>>>>          </retention>
>>>>>          <retention>
>>>>>            <period>D</period>
>>>>>            <offset>30</offset>
>>>>>          </retention>
>>>>>        </aggregate>
>>>>>
>>>>>        <purge>
>>>>>          <offset>30</offset>
>>>>>          <period>D</period>
>>>>>        </purge>
>>>>>      </cache>
>>>>>
>>>>> Regards,
>>>>> Rahul.
>>>>> On Monday 08 September 2014 08:39 PM, Anders Håål wrote:
>>>>>> Great if you can make a debian package, and I understand that you 
>>>>>> can
>>>>>> not commit. The best thing would be integrated to our build process
>>>>>> where we use ant.
>>>>>>
>>>>>> if the purging is based on time then it could happen that data is
>>>>>> removed from the cache since the logic is based on time relative to
>>>>>> now. To avoid it you should increase the purge time before you start
>>>>>> bischeck. And just a comment on your last sentence Redis TTl is 
>>>>>> never
>>>>>> used :)
>>>>>> Anders
>>>>>>
>>>>>> On 09/08/2014 02:09 PM, Rahul Amaram wrote:
>>>>>>> I would be more than happy to give you guys a testimonial. 
>>>>>>> However, we
>>>>>>> have just taken this live and would like to see its performance
>>>>>>> before I
>>>>>>> give a testimonial.
>>>>>>>
>>>>>>> Also, if time permits, I'll try to bundle this for Debian (I'm a
>>>>>>> Debian
>>>>>>> maintainer). I can't commit on a timeline right away though :).
>>>>>>>
>>>>>>> Also, just to make things explicitly clear. I understand that the
>>>>>>> below
>>>>>>> service item ttl has nothing to do with Redis TTL. But If I stop my
>>>>>>> bischeck server for a day or two, then would any of my metrics get
>>>>>>> lost?
>>>>>>> Or would I have to increase th Redis TTL for this.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Rahul.
>>>>>>>
>>>>>>> On Monday 08 September 2014 04:09 PM, Anders Håål wrote:
>>>>>>>> Glad that it clarified how to configure the cache section. I will
>>>>>>>> make
>>>>>>>> a blog post on this in the mean time, until we have a updated
>>>>>>>> documentation. I agree with you that the structure of the
>>>>>>>> configuration is a bit "heavy", so ideas and input is appreciated.
>>>>>>>>
>>>>>>>> Regarding redis ttl, this is a redis feature we do not use. The 
>>>>>>>> ttl
>>>>>>>> mentioned in my mail is managed by bischeck. Redis ttl on 
>>>>>>>> linked list
>>>>>>>> do not work on individual nodes in a redis linked list.
>>>>>>>>
>>>>>>>> Currently the bischeck installer should work for ubuntu,
>>>>>>>> redhat/centos
>>>>>>>> and debian. There is currently no plans to make distribution 
>>>>>>>> packages
>>>>>>>> like rpm or deb. I know op5 (www.op5.com) that bundles Bischeck
>>>>>>>> make a
>>>>>>>> bischeck rpm. It would be super if there is any one that like 
>>>>>>>> to do
>>>>>>>> this for the project.
>>>>>>>> When it comes to packaging we have done a bit of work to create
>>>>>>>> docker
>>>>>>>> containers, but its still experimental.
>>>>>>>>
>>>>>>>> I also encourage you, if you think bischeck support your 
>>>>>>>> monitoring
>>>>>>>> effort, to write a small testimony that we can put on the site.
>>>>>>>> Regards
>>>>>>>> Anders
>>>>>>>>
>>>>>>>> On 09/08/2014 11:30 AM, Rahul Amaram wrote:
>>>>>>>>> Thanks Anders. This explains precisely why my data was getting
>>>>>>>>> purged
>>>>>>>>> after 16 hours (30 values per hour * 1 hours = 480). It would be
>>>>>>>>> great
>>>>>>>>> if you could update the documentation with this info. The entire
>>>>>>>>> setup
>>>>>>>>> and configuration itself takes time to get a hold on and detailed
>>>>>>>>> documentation would be very helpful.
>>>>>>>>>
>>>>>>>>> Also, another quick question? Right now, I believe the Redis 
>>>>>>>>> TTL is
>>>>>>>>> set
>>>>>>>>> to 2000 seconds. Does this mean that if I don't receive data 
>>>>>>>>> for a
>>>>>>>>> particular serviceitem (or service or host) for a 2000 
>>>>>>>>> seconds, the
>>>>>>>>> data
>>>>>>>>> related to it is lost?
>>>>>>>>>
>>>>>>>>> Also, any plans for bundling this with distributions such as 
>>>>>>>>> Debian?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Rahul.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Monday 08 September 2014 02:04 PM, Anders Håål wrote:
>>>>>>>>>> Hi Rahul,
>>>>>>>>>> Thanks for the question and feedback on the documentation. 
>>>>>>>>>> Great to
>>>>>>>>>> hear that you think Bischeck is awesome. If you do not
>>>>>>>>>> understand how
>>>>>>>>>> it works by reading the documentation you are probably not
>>>>>>>>>> alone, and
>>>>>>>>>> we should consider it a documentation bug.
>>>>>>>>>>
>>>>>>>>>> In 1.0.0 we introduce the concept that you asking about and it
>>>>>>>>>> really
>>>>>>>>>> two different independent features.
>>>>>>>>>>
>>>>>>>>>> Lets start with cache purging.
>>>>>>>>>> Collected monitoring data, metrics, are kept in the cache (redis
>>>>>>>>>> from
>>>>>>>>>> 1.0.0) as a linked lists. There is one linked list per service
>>>>>>>>>> definition, like host1-service1-serviceitem1. Prior to 1.0.0
>>>>>>>>>> all the
>>>>>>>>>> linked lists had the same size that was defined with the 
>>>>>>>>>> property
>>>>>>>>>> lastStatusCacheSize. But in 1.0.0 we made that configurable 
>>>>>>>>>> so it
>>>>>>>>>> could be defined per service definition.
>>>>>>>>>> To enable individual cache configurations we added a section 
>>>>>>>>>> called
>>>>>>>>>> <cache> in the serviceitem section of the bischeck.xml. Like 
>>>>>>>>>> many
>>>>>>>>>> other configuration options in 1.0.0 the cache section could
>>>>>>>>>> have the
>>>>>>>>>> specific values or point to a template that could be shared.
>>>>>>>>>> To manage the size of the cache , or to be more specific the 
>>>>>>>>>> linked
>>>>>>>>>> list size, we defined the <purge> section. The purge section can
>>>>>>>>>> have
>>>>>>>>>> two different configurations. The first is defining the max 
>>>>>>>>>> size of
>>>>>>>>>> the cache linked list.
>>>>>>>>>> <cache>
>>>>>>>>>>   <purge>
>>>>>>>>>>    <maxcount>1000</maxcount>
>>>>>>>>>>   </purge>
>>>>>>>>>> </cache>
>>>>>>>>>>
>>>>>>>>>> The second options is to define the “time to live” for the
>>>>>>>>>> metrics in
>>>>>>>>>> the cache.
>>>>>>>>>> <cache>
>>>>>>>>>>   <purge>
>>>>>>>>>>    <offset>10</offset>
>>>>>>>>>>    <period>D</period>
>>>>>>>>>>   </purge>
>>>>>>>>>> </cache>
>>>>>>>>>> In the above example we set the time to live to 10 days. So any
>>>>>>>>>> metrics older then this period will be removed. The period 
>>>>>>>>>> can have
>>>>>>>>>> the following values:
>>>>>>>>>> H - hours
>>>>>>>>>> D - days
>>>>>>>>>> W - weeks
>>>>>>>>>> Y - year
>>>>>>>>>>
>>>>>>>>>> The two option are mutual exclusive. You have to chose one 
>>>>>>>>>> for each
>>>>>>>>>> serviceitem or cache template.
>>>>>>>>>>
>>>>>>>>>> If no cache directive is define for a serviceitem the property
>>>>>>>>>> lastStatusCacheSize will be used. It's default value is 500.
>>>>>>>>>>
>>>>>>>>>> Hopefully this explains the cache purging.
>>>>>>>>>>
>>>>>>>>>> The next question was related to aggregations which has nothing
>>>>>>>>>> to do
>>>>>>>>>> with purging, but it's configured in the same <cache> 
>>>>>>>>>> section. The
>>>>>>>>>> idea with aggregations was to create an automatic way to 
>>>>>>>>>> aggregate
>>>>>>>>>> metrics on the level of an hour, day, week and month. The
>>>>>>>>>> aggregation
>>>>>>>>>> functions current supported is average, max and min.
>>>>>>>>>> Lets say you have a service definition of the format
>>>>>>>>>> host1-service1-serviceitem1. When you  enable an average (avg)
>>>>>>>>>> aggregation you will automatically get the following new service
>>>>>>>>>> definitions
>>>>>>>>>> host1-service1/H/avg-serviceitem1
>>>>>>>>>> host1-service1/D/avg-serviceitem1
>>>>>>>>>> host1-service1/W/avg-serviceitem1
>>>>>>>>>> host1-service1/M/avg-serviceitem1
>>>>>>>>>>
>>>>>>>>>> The configuration you need to achive the above average
>>>>>>>>>> aggregations is:
>>>>>>>>>> <cache>
>>>>>>>>>>   <aggregate>
>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>   </aggregate>
>>>>>>>>>> </cache>
>>>>>>>>>>
>>>>>>>>>> If you like to combine it with the above descibed purging your
>>>>>>>>>> configuration would look like:
>>>>>>>>>> <cache>
>>>>>>>>>>   <aggregate>
>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>   </aggregate>
>>>>>>>>>>
>>>>>>>>>>   <purge>
>>>>>>>>>>    <offset>10</offset>
>>>>>>>>>>    <period>D</period>
>>>>>>>>>>   </purge>
>>>>>>>>>> </cache>
>>>>>>>>>>
>>>>>>>>>> The new aggregated service definitions,
>>>>>>>>>> host1-service1/H/avg-serviceitem1, etc, will have their own 
>>>>>>>>>> cache
>>>>>>>>>> entries and can be used in threshold configurations and virtual
>>>>>>>>>> services like any other service definitions. For example in a
>>>>>>>>>> threshold hours section we could define
>>>>>>>>>>
>>>>>>>>>> <hours hoursID="2">
>>>>>>>>>>
>>>>>>>>>>   <hourinterval>
>>>>>>>>>>     <from>09:00</from>
>>>>>>>>>>     <to>12:00</to>
>>>>>>>>>> <threshold>host1-service1/H/avg-serviceitem1[0]*0.8</threshold>
>>>>>>>>>>   </hourinterval>
>>>>>>>>>>   ...
>>>>>>>>>>
>>>>>>>>>> This would mean that we use the average value for
>>>>>>>>>> host1-service1-serviceitem1  for the period of the last hour.
>>>>>>>>>> Aggregations are calculated hourly, daily, weekly and monthly.
>>>>>>>>>>
>>>>>>>>>> By default weekends metrics are not included in the aggrgation
>>>>>>>>>> calculation. This can be enabled by setting the
>>>>>>>>>> <useweekend>true</useweekend>:
>>>>>>>>>>
>>>>>>>>>> <cache>
>>>>>>>>>>   <aggregate>
>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>>>   </aggregate>
>>>>>>>>>>   ….
>>>>>>>>>> </cache>
>>>>>>>>>>
>>>>>>>>>> This will create aggregated service definitions with the 
>>>>>>>>>> following
>>>>>>>>>> name standard:
>>>>>>>>>> host1-service1/H/avg/weekend-serviceitem1
>>>>>>>>>> host1-service1/D/avg/weekend-serviceitem1
>>>>>>>>>> host1-service1/W/avg/weekend-serviceitem1
>>>>>>>>>> host1-service1/M/avg/weekend-serviceitem1
>>>>>>>>>>
>>>>>>>>>> You can also have multiple entries like:
>>>>>>>>>> <cache>
>>>>>>>>>>   <aggregate>
>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>>>   </aggregate>
>>>>>>>>>>   <aggregate>
>>>>>>>>>>     <method>max</method>
>>>>>>>>>>   </aggregate>
>>>>>>>>>>   ….
>>>>>>>>>> </cache>
>>>>>>>>>>
>>>>>>>>>> So how long time will the aggregated values be kept in the
>>>>>>>>>> cache? By
>>>>>>>>>> default we save
>>>>>>>>>> Hour aggregation for 25 hours
>>>>>>>>>> Daily aggregations for 7 days
>>>>>>>>>> Weekly aggregations for 5 weeks
>>>>>>>>>> Monthly aggregations for 1 month
>>>>>>>>>>
>>>>>>>>>> These values can be override but they can not be lower then the
>>>>>>>>>> default. Below you have an example where we save the aggregation
>>>>>>>>>> for
>>>>>>>>>> 168 hours, 60 days and 53 weeks.
>>>>>>>>>> <cache>
>>>>>>>>>>   <aggregate>
>>>>>>>>>>     <method>avg</method>
>>>>>>>>>>     <useweekend>true</useweekend>
>>>>>>>>>>     <retention>
>>>>>>>>>>       <period>H</period>
>>>>>>>>>>       <offset>168</offset>
>>>>>>>>>>     </retention>
>>>>>>>>>>     <retention>
>>>>>>>>>>      <period>D</period>
>>>>>>>>>>       <offset>60</offset>
>>>>>>>>>>     </retention>
>>>>>>>>>>     <retention>
>>>>>>>>>>       <period>W</period>
>>>>>>>>>>       <offset>53</offset>
>>>>>>>>>>     </retention>
>>>>>>>>>> </aggregate>
>>>>>>>>>>   ….
>>>>>>>>>> </cache>
>>>>>>>>>>
>>>>>>>>>> I hope this makes it a bit less confusing. What is clear to 
>>>>>>>>>> me is
>>>>>>>>>> that
>>>>>>>>>> we need to improve the documentation in this area.
>>>>>>>>>>
>>>>>>>>>> Looking forward to your feedback.
>>>>>>>>>> Anders
>>>>>>>>>>
>>>>>>>>>> On 09/08/2014 06:02 AM, Rahul Amaram wrote:
>>>>>>>>>>> Hi,
>>>>>>>>>>> I am trying to setup the bischeck plugin for our 
>>>>>>>>>>> organization. I
>>>>>>>>>>> have
>>>>>>>>>>> configured most part of it except for the cache retention 
>>>>>>>>>>> period.
>>>>>>>>>>> Here
>>>>>>>>>>> is what I want - I want to store every value which has been
>>>>>>>>>>> generated
>>>>>>>>>>> during the past 1 month. The reason being my threshold is
>>>>>>>>>>> currently
>>>>>>>>>>> calculated as the average of the metric value during the past 4
>>>>>>>>>>> weeks at
>>>>>>>>>>> the same time of the day.
>>>>>>>>>>>
>>>>>>>>>>> So, how do I define the cache template for this? If I don't
>>>>>>>>>>> define any
>>>>>>>>>>> cache template, for how many days is the data kept?
>>>>>>>>>>> Also, how does the aggregrate function work and and what 
>>>>>>>>>>> does the
>>>>>>>>>>> purge
>>>>>>>>>>> Maxitems signify?
>>>>>>>>>>>
>>>>>>>>>>> I've gone through the documentation but it wasn't clear. 
>>>>>>>>>>> Looking
>>>>>>>>>>> forward
>>>>>>>>>>> to a response.
>>>>>>>>>>>
>>>>>>>>>>> Bischeck is one awesome plugin. Keep up the great work.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Rahul.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


-- 

Ingby<http://www.ingby.com>

IngbyForge<http://gforge.ingby.com>

bischeck - dynamic and adaptive thresholds for Nagios <http://www.bischeck.org>

anders.haal at ingby.com<mailto:anders.haal at ingby.com>

Mjukvara genom ingenjörsmässig kreativitet och kompetens

Ingenjörsbyn
Box 531
101 30 Stockholm
Sweden
www.ingby.com <http://www.ingby.com/>
Mobil: +46 70 575 35 46
Tele: +46 75 75 75 090
Fax:  +46 75 75 75 091



More information about the Bischeck-users mailing list