High CPU consumption by java and redis-server

Anders Håål anders.haal at ingby.com
Fri Sep 19 16:08:25 CEST 2014


I have looked at the logic for query by time and the current 
implementation works really bad for large list and just bad for smaller 
lists. The searching is just a brute force sequential search to find the 
closest. So the longer the list is and if the query is time that is up 
in the end of the list performance will just suck. To my only defense 
this is remaining code from when we did not have the cache in redis, but 
Im not sure its an good excuse.
Okay enough with remorse, what can be done? I have tested using a binary 
search. On a list with 15000 items we get a hit after 13-14 iterations.  
So for the same "heavy" test query I got the time down from 6684 ms to 
55 ms.
The plan is to fix this and then release a 1.2.0 beta that hopefully you 
and others can test. Hopefully I can have a beta in the beginning of the 
week.

@Rahul - Thanks for finding this issue. If you have the time please make 
a bug report on this so we have the history. I also think I  
misunderstood your question about "if data is missing in-between" and 
you are of course correct that this will not work well using a query by 
index.

Anders


On 09/18/2014 10:05 AM, anders.haal at ingby.com wrote:
> Hi Rahul,
> I really like numbers. Can you describe how much it changed to the 
> better, bischeck utilization and redis utilization, etc? I will as 
> soon as possible start testing to verify the problem you encounter so 
> data to compare with would be great. If possible it would be great to 
> have a look of your bischeck configuration files if you can share 
> them. Of course you can send all that stuff directly to me if you do 
> not want it to be shared on the list.
>
> When I take about index for time(stamp) it will use a different redis 
> storage collection in. In the current version we use a linked list to 
> store metrics data. That makes finding a metric item by its position, 
> in bischeck called "query by index", easy and fairly quick. But when 
> when we search for timestamp, which is data on the metrics stored in 
> the the list we need to sort the the list to find the item closest to 
> the time directive, like -24H- When list grow this search becomes 
> heavier and heavier. So the idea we have is to store the metrics data 
> in a sorted set with timestamp as the score, see redis.io, and inte 
> linked list only store the timestamp/score. With the finding a time 
> range will be fast since its by score, like [-12M:-30M]. Searching for 
> a index position, like [3] will require two reads, find it in the 
> linked list, get the score and read the full item from the sorted set. 
> Finding range by index, [0:9] will finding score for index 0 and 9 and 
> then get all item from the sorted set by the score.
> When it comes to query by time [-12H] our definition is to find the 
> metric with closest to 12 hours ago. As you say this is a bit more 
> tricky. What do closes mean and how much should the span be? Like you 
> say if there is no data for a period of hours, well then the are now 
> data and maybe thats would be returned. The same goes for a time 
> range, [-12H, -24H]. If there are no data for that period the current 
> version will return null. So for the the query by time, [-12H], we 
> should maybe use some sort of tolerance so the search is always a 
> range of +- X minutes around the request value, and what is returned 
> is the median of the result set.
> Input and feedback is welcome, but this change, I think, will come in 
> a major version of bischeck. The question is now in the short term we 
> could improve the search speed for a large list, if that is the problem.
>
> When it comes to using multiple instance of redis Im not sure its a 
> great idea. That would mean that bischeck must implement some sort of 
> partioning/sharding. Redis has proven it self to be an excellent 
> storage so I strongly believe we first need to implement our usage of 
> it in the right way before we start thinking about this type of solutions.
>
>
>
> On 2014-09-18 05:47, Rahul Amaram wrote:
>> Using indices has got down CPU usage of the java process 
>> considerably. But won't indices create problem if data is missing 
>> in-between? For ex. if data for 2-3 hours is missing, then won't 
>> indices be offset by that time? Also, does bischeck support using 
>> multiple redis-instances so that multi-core be exploited?
>>
>> Regards,
>> Rahul.
>>
>> On Thursday 18 September 2014 01:58 AM, Anders Håål wrote:
>>> Okay. 4-6 to seconds is absolutely to much and it may be related how 
>>> the query by time is implement. The current search is pretty brute 
>>> force finding the "right" time. Its not searching linearly but there 
>>> are no "index" on time.  Searching by index is much quicker and 
>>> query by time will be related to the size. With your 6 values the 
>>> search has to be done 6 times over a list that are 5000 items. The 
>>> future idea I mentioned will be a sort of index for the timestamp by 
>>> using a sorted set.
>>> What I would recommend you to do is to use index instead and see how 
>>> that effect the performance. Since you use a interval of 120 sec, 
>>> the -24H will be the same as index 720, -96H will be the same as 
>>> index 2880, etc.
>>> I will try to get the time to set up an equivalent test environment. 
>>> Keep me updated of your investigation
>>> Anders
>>>
>>> On 09/17/2014 09:18 PM, Rahul Amaram wrote:
>>>> /When it comes to your last finding I have no explanation. Just to 
>>>> understand you compare using -24H with -10080M (-168H). Would it 
>>>> not be better to compare -24H and -1440M. I have to get back to you 
>>>> on this but I would need to get the result when running in cacheCli 
>>>> since you get the time it takes, 
>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4.//
>>>> ///
>>>> This was a typo. I was talking about -168H and -10080M. Also, I 
>>>> used "bischeck cli.CacheCli" to check this. And I re-ran this now, 
>>>> but not finding much difference between both of them (it takes 
>>>> about 4-6 seconds to retrieve the value).
>>>>
>>>> Reg. other points, I have to get back to you. On a side note, I 
>>>> have upgraded from redis-server 2.6 to 2.8, just to rule out any 
>>>> version performance issues.
>>>>
>>>> Thanks,
>>>> Rahul.
>>>>
>>>>
>>>> On Thursday 18 September 2014 12:19 AM, Anders Håål wrote:
>>>>> Hi Rahul,
>>>>> Looking at your threshold this means that you will retrieve max 6 
>>>>> values, which should not be that "hard" even if its a time based 
>>>>> query - using index is faster and is something we will look into 
>>>>> in the future.
>>>>> Since you run the query every 120 sec it means that you currently 
>>>>> have at lest 5040 items in the cache for this each service, which 
>>>>> does not sound to bad. 10 services  at least 50000 in total.
>>>>> What I like you to check is the following:
>>>>> - If you connect with some JMX client against bischeck you can see 
>>>>> all the different timers 
>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Chapter-5. 
>>>>> The once that are related to threshold are inserting to start with 
>>>>> but check all the different timers if some one have long execution 
>>>>> time.
>>>>> - Since the its the redis-server that are consume a high level of 
>>>>> CPU its interesting to see the configuration for redis - like the 
>>>>> amount of memory allocated. If redis need to swap its not good.
>>>>> - Please check the redis log files.
>>>>> - You can also connect to redis with redis-cli and run command 
>>>>> "monitor" to get a real time listing on the commands executed 
>>>>> against redis.
>>>>> - Also check with top the percentage of %wa, waiting for io. How 
>>>>> much memory do you have on the server? Only running bischeck and 
>>>>> redis?
>>>>> - How much cpu is bischeck consuming? Do you see any peaks?
>>>>> - Also check the bischeck log to see any ERROR or WARN.
>>>>> - And finally - has this been the behavior from the beginning or 
>>>>> has it increased over time? What happen if you restart bischeck 
>>>>> (not reload)?
>>>>>
>>>>> Try to collect some more info so we can try to determine where the 
>>>>> issue is related.
>>>>>
>>>>> When it comes to your last finding I have no explanation. Just to 
>>>>> understand you compare using -24H with -10080M (-168H). Would it 
>>>>> not be better to compare -24H and -1440M. I have to get back to 
>>>>> you on this but I would need to get the result when running in 
>>>>> cacheCli since you get the time it takes, 
>>>>> http://www.bischeck.org/wp-content/uploads/2014/06/Bischeck_installation_and_administration_guide.html#toc-Section-4.4.
>>>>>
>>>>>
>>>>> Regards
>>>>> Anders
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 09/17/2014 07:13 PM, Rahul Amaram wrote:
>>>>>> Hi,
>>>>>> I am observing very high CPU consumption by the java process and 
>>>>>> redis-server. redis-server being single threaded it self is 
>>>>>> taking 100% CPU. I have about 10 hosts, with about 10 services 
>>>>>> each (with one service item per service). The time interval for 
>>>>>> generation of value is 120s. The threshold that I have defined is:
>>>>>>
>>>>>> avg($$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-24H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-96H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-168H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-336H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-504H],$$HOSTNAME$$-$$SERVICENAME$$-$$SERVICEITEMNAME$$[-672H]) 
>>>>>>
>>>>>>
>>>>>> However, currently, not more than 3 values, are available.
>>>>>>
>>>>>> I am already running this on a c3.xlarge machine (4 cores) and 
>>>>>> the load average is quite often > 4 resulting in delay of 
>>>>>> generation of values. Any pointers in what could be causing the 
>>>>>> high load would be much appreciated.
>>>>>>
>>>>>> On a slightly different note, while using cli.CacheCli, 
>>>>>> retrieving the value of a service item one week back using hours 
>>>>>> (-24H) is considerably faster than retrieving it using minutes 
>>>>>> (-10080M). Again, why does bischeck behave this way?
>>>>>>
>>>>>> Thanks,
>>>>>> Rahul.
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>> -- 
>>>
>>> Ingby<http://www.ingby.com>
>>>
>>> IngbyForge<http://gforge.ingby.com>
>>>
>>> bischeck - dynamic and adaptive thresholds for Nagios<http://www.bischeck.org>
>>>
>>> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>>>
>>> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>>>
>>> Ingenjörsbyn
>>> Box 531
>>> 101 30 Stockholm
>>> Sweden
>>> www.ingby.com  <http://www.ingby.com/>
>>> Mobil: +46 70 575 35 46
>>> Tele: +46 75 75 75 090
>>> Fax:  +46 75 75 75 091
>>>
>>
>>
>
> -- 
>
>
> Ingby<http://www.ingby.com>
>
> IngbyForge<http://gforge.ingby.com>
>
> bischeck - dynamic and adaptive monitoring for Nagios<http://www.bischeck.org>
>
> anders.haal at ingby.com<mailto:anders.haal at ingby.com>
>
> Mjukvara genom ingenjörsmässig kreativitet och kompetens
>
> Ingenjörsbyn
> Box 531
> 101 30 Stockholm
> Sweden
> www.ingby.com  <http://www.ingby.com/>
> Mobil: +46 70 575 35 46
> Tele: +46 75 75 75 090
> Fax:  +46 75 75 75 091


-- 

Ingby<http://www.ingby.com>

IngbyForge<http://gforge.ingby.com>

bischeck - dynamic and adaptive thresholds for Nagios <http://www.bischeck.org>

anders.haal at ingby.com<mailto:anders.haal at ingby.com>

Mjukvara genom ingenjörsmässig kreativitet och kompetens

Ingenjörsbyn
Box 531
101 30 Stockholm
Sweden
www.ingby.com <http://www.ingby.com/>
Mobil: +46 70 575 35 46
Tele: +46 75 75 75 090
Fax:  +46 75 75 75 091

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/bischeck-users/attachments/20140919/301eb8f7/attachment-0001.html>


More information about the Bischeck-users mailing list