High CPU consumption by java and redis-server

Rahul Amaram rahul.amaram at vizury.com
Sat Sep 20 07:46:59 CEST 2014


Hi Anders,
I have raised a ticket - 
http://gforge.ingby.com/gf/project/bischeck/tracker/?action=TrackerItemEdit&tracker_item_id=260 
.

Really appreciate the prompt responses and the quick action to fix the 
issue :).

Regards,
Rahul.

On Friday 19 September 2014 07:38 PM, Anders Håål wrote:
> I have looked at the logic for query by time and the current 
> implementation works really bad for large list and just bad for 
> smaller lists. The searching is just a brute force sequential search 
> to find the closest. So the longer the list is and if the query is 
> time that is up in the end of the list performance will just suck. To 
> my only defense this is remaining code from when we did not have the 
> cache in redis, but Im not sure its an good excuse.
> Okay enough with remorse, what can be done? I have tested using a 
> binary search. On a list with 15000 items we get a hit after 13-14 
> iterations.  So for the same "heavy" test query I got the time down 
> from 6684 ms to 55 ms.
> The plan is to fix this and then release a 1.2.0 beta that hopefully 
> you and others can test. Hopefully I can have a beta in the beginning 
> of the week.
>
> @Rahul - Thanks for finding this issue. If you have the time please 
> make a bug report on this so we have the history. I also think I  
> misunderstood your question about "if data is missing in-between" and 
> you are of course correct that this will not work well using a query 
> by index.
>
> Anders
>
>
> On 09/18/2014 10:05 AM, anders.haal at ingby.com wrote:
>> Hi Rahul,
>> I really like numbers. Can you describe how much it changed to the 
>> better, bischeck utilization and redis utilization, etc? I will as 
>> soon as possible start testing to verify the problem you encounter so 
>> data to compare with would be great. If possible it would be great to 
>> have a look of your bischeck configuration files if you can share 
>> them. Of course you can send all that stuff directly to me if you do 
>> not want it to be shared on the list.
>>
>> When I take about index for time(stamp) it will use a different redis 
>> storage collection in. In the current version we use a linked list to 
>> store metrics data. That makes finding a metric item by its position, 
>> in bischeck called "query by index", easy and fairly quick. But when 
>> when we search for timestamp, which is data on the metrics stored in 
>> the the list we need to sort the the list to find the item closest to 
>> the time directive, like -24H- When list grow this search becomes 
>> heavier and heavier. So the idea we have is to store the metrics data 
>> in a sorted set with timestamp as the score, see redis.io, and inte 
>> linked list only store the timestamp/score. With the finding a time 
>> range will be fast since its by score, like [-12M:-30M]. Searching 
>> for a index position, like [3] will require two reads, find it in the 
>> linked list, get the score and read the full item from the sorted 
>> set. Finding range by index, [0:9] will finding score for index 0 and 
>> 9 and then get all item from the sorted set by the score.
>> When it comes to query by time [-12H] our definition is to find the 
>> metric with closest to 12 hours ago. As you say this is a bit more 
>> tricky. What do closes mean and how much should the span be? Like you 
>> say if there is no data for a period of hours, well then the are now 
>> data and maybe thats would be returned. The same goes for a time 
>> range, [-12H, -24H]. If there are no data for that period the current 
>> version will return null. So for the the query by time, [-12H], we 
>> should maybe use some sort of tolerance so the search is always a 
>> range of +- X minutes around the request value, and what is returned 
>> is the median of the result set.
>> Input and feedback is welcome, but this change, I think, will come in 
>> a major version of bischeck. The question is now in the short term we 
>> could improve the search speed for a large list, if that is the problem.
>>
>> When it comes to using multiple instance of redis Im not sure its a 
>> great idea. That would mean that bischeck must implement some sort of 
>> partioning/sharding. Redis has proven it self to be an excellent 
>> storage so I strongly believe we first need to implement our usage of 
>> it in the right way before we start thinking about this type of 
>> solutions.
>>
>>
>>
>> On 2014-09-18 05:47, Rahul Amaram wrote:
>>> Using indices has got down CPU usage of the java process 
>>> considerably. But won't indices create problem if data is missing 
>>> in-between? For ex. if data for 2-3 hours is missing, then won't 
>>> indices be offset by that time? Also, does bischeck support using 
>>> multiple redis-instances so that multi-core be exploited?
>>>
>>> Regards,
>>> Rahul.
>>>
>>> On Thursday 18 September 2014 01:58 AM, Anders Håål wrote:
>>>> Okay. 4-6 to seconds is absolutely to much and it may be related 
>>>> how the query by time is implement. The current search is pretty 
>>>> brute force finding the "right" time. Its not searching linearly 
>>>> but there are no "index" on time.  Searching by index is much 
>>>> quicker and query by time will be related to the size. With your 6 
>>>> values the search has to be done 6 times over a list that are 5000 
>>>> items. The future idea I mentioned will be a sort of index for the 
>>>> timestamp by using a sorted set.
>>>> What I would recommend you to do is to use index instead and see 
>>>> how that effect the performance. Since you use a interval of 120 
>>>> sec, the -24H will be the same as index 720, -96H will be the same 
>>>> as index 2880, etc.
>>>> I will try to get the time to set up an equivalent test 
>>>> environment. Keep me updated of your investigation
>>>> Anders
>>>>
>>>> On 09/17/2014 09:18 PM, Rahul Amaram wrote:
>>>>> /When it comes to your last finding I have no explanation. Just to 
>>>>> understand you compare using -24H with -10080M (-168H). Would it 
>>>>> not be better to compare -24H and -1440M. I have to get back to 
>>>>> you on this but I would need to get the result when running in 
>>>>> cacheCli since you get the time it takes, 
>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4.//
>>>>> ///
>>>>> This was a typo. I was talking about -168H and -10080M. Also, I 
>>>>> used "bischeck cli.CacheCli" to check this. And I re-ran this now, 
>>>>> but not finding much difference between both of them (it takes 
>>>>> about 4-6 seconds to retrieve the value).
>>>>>
>>>>> Reg. other points, I have to get back to you. On a side note, I 
>>>>> have upgraded from redis-server 2.6 to 2.8, just to rule out any 
>>>>> version performance issues.
>>>>>
>>>>> Thanks,
>>>>> Rahul.
>>>>>
>>>>>
>>>>> On Thursday 18 September 2014 12:19 AM, Anders Håål wrote:
>>>>>> Hi Rahul,
>>>>>> Looking at your threshold this means that you will retrieve max 6 
>>>>>> values, which should not be that "hard" even if its a time based 
>>>>>> query - using index is faster and is something we will look into 
>>>>>> in the future.
>>>>>> Since you run the query every 120 sec it means that you currently 
>>>>>> have at lest 5040 items in the cache for this each service, which 
>>>>>> does not sound to bad. 10 services at least 50000 in total.
>>>>>> What I like you to check is the following:
>>>>>> - If you connect with some JMX client against bischeck you can 
>>>>>> see all the different timers 
>>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Chapter-5. 
>>>>>> The once that are related to threshold are inserting to start 
>>>>>> with but check all the different timers if some one have long 
>>>>>> execution time.
>>>>>> - Since the its the redis-server that are consume a high level of 
>>>>>> CPU its interesting to see the configuration for redis - like the 
>>>>>> amount of memory allocated. If redis need to swap its not good.
>>>>>> - Please check the redis log files.
>>>>>> - You can also connect to redis with redis-cli and run command 
>>>>>> "monitor" to get a real time listing on the commands executed 
>>>>>> against redis.
>>>>>> - Also check with top the percentage of %wa, waiting for io. How 
>>>>>> much memory do you have on the server? Only running bischeck and 
>>>>>> redis?
>>>>>> - How much cpu is bischeck consuming? Do you see any peaks?
>>>>>> - Also check the bischeck log to see any ERROR or WARN.
>>>>>> - And finally - has this been the behavior from the beginning or 
>>>>>> has it increased over time? What happen if you restart bischeck 
>>>>>> (not reload)?
>>>>>>
>>>>>> Try to collect some more info so we can try to determine where 
>>>>>> the issue is related.
>>>>>>
>>>>>> When it comes to your last finding I have no explanation. Just to 
>>>>>> understand you compare using -24H with -10080M (-168H). Would it 
>>>>>> not be better to compare -24H and -1440M. I have to get back to 
>>>>>> you on this but I would need to get the result when running in 
>>>>>> cacheCli since you get the time it takes, 
>>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4.
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Anders
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 09/17/2014 07:13 PM, Rahul Amaram wrote:
>>>>>>> Hi,
>>>>>>> I am observing very high CPU consumption by the java process and 
>>>>>>> redis-server. redis-server being single threaded it self is 
>>>>>>> taking 100% CPU. I have about 10 hosts, with about 10 services 
>>>>>>> each (with one service item per service). The time interval for 
>>>>>>> generation of value is 120s. The threshold that I have defined is:
>>>>>>>
>>>>>>> avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-96H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-504H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-672H]) 
>>>>>>>
>>>>>>>
>>>>>>> However, currently, not more than 3 values, are available.
>>>>>>>
>>>>>>> I am already running this on a c3.xlarge machine (4 cores) and 
>>>>>>> the load average is quite often > 4 resulting in delay of 
>>>>>>> generation of values. Any pointers in what could be causing the 
>>>>>>> high load would be much appreciated.
>>>>>>>
>>>>>>> On a slightly different note, while using cli.CacheCli, 
>>>>>>> retrieving the value of a service item one week back using hours 
>>>>>>> (-24H) is considerably faster than retrieving it using minutes 
>>>>>>> (-10080M). Again, why does bischeck behave this way?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Rahul.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> Ingby<http://www.ingby.com>
>>>>
>>>> IngbyForge<http://gforge.ingby.com>
>>>>
>>>> bischeck - dynamic and adaptive thresholds for Nagios<http://www.bischeck.org>
>>>>
>>>> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>>>>
>>>> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>>>>
>>>> Ingenjörsbyn
>>>> Box 531
>>>> 101 30 Stockholm
>>>> Sweden
>>>> www.ingby.com  <http://www.ingby.com/>
>>>> Mobil: +46 70 575 35 46
>>>> Tele: +46 75 75 75 090
>>>> Fax:  +46 75 75 75 091
>>>>
>>>
>>>
>>
>> -- 
>>
>>
>> Ingby<http://www.ingby.com>
>>
>> IngbyForge<http://gforge.ingby.com>
>>
>> bischeck - dynamic and adaptive monitoring for Nagios<http://www.bischeck.org>
>>
>> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>>
>> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>>
>> Ingenjörsbyn
>> Box 531
>> 101 30 Stockholm
>> Sweden
>> www.ingby.com  <http://www.ingby.com/>
>> Mobil: +46 70 575 35 46
>> Tele: +46 75 75 75 090
>> Fax:  +46 75 75 75 091
>
>
> -- 
>
> Ingby<http://www.ingby.com>
>
> IngbyForge<http://gforge.ingby.com>
>
> bischeck - dynamic and adaptive thresholds for Nagios<http://www.bischeck.org>
>
> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>
> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>
> Ingenjörsbyn
> Box 531
> 101 30 Stockholm
> Sweden
> www.ingby.com  <http://www.ingby.com/>
> Mobil: +46 70 575 35 46
> Tele: +46 75 75 75 090
> Fax:  +46 75 75 75 091
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/bischeck-users/attachments/20140920/494c0071/attachment-0001.html>


More information about the Bischeck-users mailing list