Skiplists (was: Re: Logging API revamp)

John P. Rouillard rouilj+nagiosdev at cs.umb.edu
Mon Oct 15 19:45:10 CEST 2007


In message <4713932E.1020904 at op5.se>,
Andreas Ericsson writes:

>John P. Rouillard wrote:
>> In message <4712B60C.1080602 at op5.se>,
>> Andreas Ericsson writes:
>> 
>>> Andreas Ericsson wrote:
>>>> Ethan Galstad wrote:
>>>>> For the event queue, I was thinking that a skip list structure might be
> 
>>>>> best for efficiency (http://en.wikipedia.org/wiki/Skip_list).  The even
>t 
>>>>> queue is used in primarily two situations:
>>>>>
>>>>> 1. Popping events from the head of the list to be executed
>>>>> 2. Inserting events into the list (mid- or endpoint).
>>>>>
>>>>> #1 is very efficient with a linked list, but performance with #2 can be
>>>>> quite bad in large lists.  Since a check event usually appears for each
>>>>> host/service that is defined, this can lead to bad performance - O(n^2)
>>>>> I believe - with large installations.  A skip list would bring the 
>>>>> performance closer to O(log n).
>>>>>
>>>>> Anyone have comments/experiences they care to share about the 
>>>>> performance of skip lists and/or better alternatives?
>>>>>
>>>> A skiplist would probably do wonders. I've been experimenting with one
>>>> now, actually using the timestamp for when next to execute <action>
>>> Scratch that. I only glossed the skiplist page you linked, and
>>> misunderstood completely. I've been experimenting with a hash, except
>>> the time used for input is the key. Implementation is trivial and
>>> performance improvement should be fairly close to or better than a
>>> skip-list.
>> 
>> Hmm, maybe I am missing something, but if I have an element with a
>> known key, I see how a hash table pointing into a conventional linked
>> list will speed up access. But doesn't case 2 take the form:
>> 
>>   insert node X that should run at time t into the linked list
>> 
>> mean a sequential scan of the linked list? How do you know what time
>> to hash to find the node before time t?
>
>It's a sequential scan of a small part of the linked list, just like
>any non-perfect hash. Since perfect hashes aren't available for this,
>we'll have to use a non-perfect one.
>> 
>>   E.G. I want node X inserted at 10:24:33, how do I know if the prior
>>        node is at 10:24:32, or 10:14:32?
>> 
>
>With a 60 buckets in the hash-array, you wouldn't be able to tell the
>difference without comparing the timestamp of the other scheduled items
>in the list. With more than 600 buckets, they'd end up in separate lists.
>
>However, we won't ever have to walk through *all* the lists, we just have
>to find the right "end" of things.
>
>Assume for a moment that the highest normal_check_interval you have for
>your services is 15 minutes. 1024 buckets will hold your 15 minutes, with
>2.067 minutes to spare. Now let's say you want to schedule a service-check
>x seconds into the future.
>
>slot = time(NULL) + x;
>if (x > num_buckets) {
>	walk_from_tail(event_array[slot]);
>}
>else {
>	walk_from_head(event_array[slot]);
>}

Ah ok your hash function is the timestamp. So all items scheduled for
time X are in the same bucket if the number of buckets exceeds the
number of seconds from now till the maximum future time you have in
the list.

This works because the hash function preserves the order of the keys
rather than randonly distributing them like a normal has table. I.E.

key1 = timestamp1
key2 = timestamp2

and

 hash(key1) > hash(key2) if key1 > key2

so walking the buckets is the same as skipping across the linked list
in order.

>Since we're not likely to schedule checks several hours away,

I have a few checks that run every couple of hours, so I would just
have a larger hash table, or would you need to have a seperate bin
that gets the entries that aren't schedulable in the normal hash
table, and it filters them in on a regular basis.

>and we don't
>have to walk *past* all items with the same scheduled time as our own event
>(since the list is going to be sorted based on a non-unique index), the most
>common case will be that we do only two operations to find our slot.

Since we can add the new item into the head/tail of the same bucket.

>>> In particular, I'm worried about this snippet:
>>>
>>> ---%<---%<---%<---
>>> Insertions and deletions are implemented much like the corresponding
>>> linked-list operations, except that "tall" elements must be inserted
>>> into or deleted from more than one linked list.
>>> ---%<---%<---%<---
>>>
>>> I'm sure it'd be good if one gets it completely right, but it sounds
>>> non-trivial, and with a large object base (100k list-nodes or more),
>>> I'm guessing that "inserted into or delete from more than one
>>> linked list" will eat up the performance gains, unless one sacrifices
>>> lots of memory to keep a very large number of "buckets", although I
>>> fail to see how that would help, since one would then have to walk
>>> a larger number of lists in order to find the right one.
>> 
>> Does this agree with what you are thinking? As I understand skip
>> lists, you have a node that looks like:
>> 
>>      struct node {
>>           listN_forward * node;
>>           listN_reverse * node;
>>           listN-1_forward * node;
>>           listN-1_reverse * node;
>> 	  ...	
>>           list0-forward * node;
>>           list0-reverse * node;
>>           data...
>>      }
>> 
>> Please exclude the explicit hard coded lists pointers.
>> 
>> where my "listN" is your bucket (note some (most) implementations use
>> only forward and not reverse pointers). So a single node can be on
>> multiple lists. Then you have something like:
>> 
>> 
>> 
>>          node1  node2   node3  node4  node5  node6  node7
>> 
>> N          <--------------------><------------------->  
>> 
>> N-1        <-------------><-------------><----------->
>> 
>> N-2        
>> 
>> ...
>> 
>> 0          <------><-----><------><-----><-----><---->
>> 
>> So to find node3, you search the N linked list starting a node1, and
>> end up at node 4 (by following the ----> forward pointer).
>
>Why? Sorry, but I fail to see why node1[0]->fwd == node4. Hoping to
>learn something here, and not arguing :)

Sorry, make that follow node1[N]->forward to node4, in skiplists you
always start searching at the higherst linked list (N in this case).
If the node you are looking for isn't on that list, you go the node
that has the closest key preceeding the one you want and look to see
if any of it's N-1...1 pointers are assigned. If so start searching
along that linked list. At some point you end up searching along the
0'th linked list to find the node.

I'm not sure how duplicates are handled, but I would require only the
first node of a duplicate set of nodes to appear in any linked list
except the 0 list. So you always have a sublist search if you have
duplicate keys. Do we have duplicate keys in the current linked list
code?

>> Since
>> node4's key is > node 3's key, you realize you have overshot node3, so
>> you go back to node1 and search the linked list N-1.
>> 
>> Following the forward link from node1(N-1) leads you right to node3.
>> 
>> If you wanted node6, again start at node1 follow the N forward link to
>> node4,
>
>So each scheduled event has 2*N*sizeof(ptrdiff_t) overhead? Hmm. I imagine
>that'll be quite huge for large datasets.

Or just N*sizeof(prtdiff_t), the reverse links aren't required. I am
not sure if the current linked list implementation is bi-directional
or not. But what's a factor of 2 among friends 8-).

>> follow the N forward link to node7, you have overshot, so
>> follow the link back to node 4 and follow the node4(N-1) forward link
>> to node5. Then node5((N-1) leads to node7, which overshoots node 6, so
>> follow the node5(0) link to node6, and you have success.
>> 
>> To find node 6 is 5 traversals. Which is really not much of a savings,
>> but with thousands of services I would expect a larger speedup.
>> 
>> To add a new node, (say nodeA) that falls between node4 and node5 for
>> example only requires finding the space (3 comparisons,
>> node1->node4->(node7)->node5), and inserting in the 0 list the new
>> nodeA.
>> 
>> To delete a node is the same, but nodes: 1, 3, 4, 5 and 7 are tall
>> (i.e they exist on multiple lists). So deleting node4 means:
>> 
>>      node1(N) forward reference points to node4 forward reference
>>        (i.e. 7)
>> 
>>      node7(N) reverse reference points to node4's reverse reference
>>        (i.e. 1)
>> 
>>      node3(0) forward points to node5
>> 
>>      node5(0) reverse reference points to node node3.
>> 
>> On all lists except 0, the nodes that get >< are randomly determined,
>> and are recalculated on anything that walks the 0 linked list.
>> 
>
>Will the performance deteriorate if nothing ever walks that list?

I claim yes, but how badly, I don't know. It's similar to having an
imperfect hash and getting more data in a single bucket that you have
to search.

When you recalculate it's similar to dynamically changing your hash
algorithm to better distribute the items across the buckets, and
rebulding your has table.

>> Now we do lose memory from having to maintain pointers for the N...0
>> lists. I am not sure how many lists have to be kept for M items.  The
>> skiplist stuff I have done is all in perl using a version of
>> Algorithm::SkipList which automagically sets the number of linked
>> lists to use.
>> 
>> Since in real life you would want to be able to arbitrarily
>> set/change the number of lists, a real structure might looks like:
>> 
>>   struct node {
>> 	node *forward[]; /* forward is an array of pointers to nodes IIRC*/
>> 	node *reverse[]; /* reverse is an array of pointers to nodes IIRC*/
>>         data.....
>>    }
>>       
>> where forward and reverse are malloc'd data area's. So you could tune
>> the number of levels based on how many entries you had. You could even
>> reallocate the structures growing and shrinking the levels as needed I
>> think. (Although the bookkeeping for that makes be a bit ill.)
>> 
>
>I can imagine. The book-keeping even without that is making me fainly ill.
>
>I guess I just don't understand Skiplists properly.

IIRC skiplists are meant to be a balanced b-tree implemention
replacement but without having to do all the rebalancing every time a
node is entered. Since new node insertion only affects the nodes on
either side of it (unlike in a balanced tree where the whole tree has
to be restructured to maintain a constant search time) you can have
multiple threads updating the list compared to some other data
structures.

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list