Skiplists

Andreas Ericsson ae at op5.se
Mon Oct 15 23:55:36 CEST 2007


John P. Rouillard wrote:
> In message <4713932E.1020904 at op5.se>,
> Andreas Ericsson writes:
> 
>> John P. Rouillard wrote:
>>> In message <4712B60C.1080602 at op5.se>,
>>> Andreas Ericsson writes:
>>>
>>>> Andreas Ericsson wrote:
>>>>> Ethan Galstad wrote:
>>>>>> For the event queue, I was thinking that a skip list structure might be
>>>>>> best for efficiency (http://en.wikipedia.org/wiki/Skip_list).  The even
>> t 
>>>>>> queue is used in primarily two situations:
>>>>>>
>>>>>> 1. Popping events from the head of the list to be executed
>>>>>> 2. Inserting events into the list (mid- or endpoint).
>>>>>>
>>>>>> #1 is very efficient with a linked list, but performance with #2 can be
>>>>>> quite bad in large lists.  Since a check event usually appears for each
>>>>>> host/service that is defined, this can lead to bad performance - O(n^2)
>>>>>> I believe - with large installations.  A skip list would bring the 
>>>>>> performance closer to O(log n).
>>>>>>
>>>>>> Anyone have comments/experiences they care to share about the 
>>>>>> performance of skip lists and/or better alternatives?
>>>>>>
>>>>> A skiplist would probably do wonders. I've been experimenting with one
>>>>> now, actually using the timestamp for when next to execute <action>
>>>> Scratch that. I only glossed the skiplist page you linked, and
>>>> misunderstood completely. I've been experimenting with a hash, except
>>>> the time used for input is the key. Implementation is trivial and
>>>> performance improvement should be fairly close to or better than a
>>>> skip-list.
>>> Hmm, maybe I am missing something, but if I have an element with a
>>> known key, I see how a hash table pointing into a conventional linked
>>> list will speed up access. But doesn't case 2 take the form:
>>>
>>>   insert node X that should run at time t into the linked list
>>>
>>> mean a sequential scan of the linked list? How do you know what time
>>> to hash to find the node before time t?
>> It's a sequential scan of a small part of the linked list, just like
>> any non-perfect hash. Since perfect hashes aren't available for this,
>> we'll have to use a non-perfect one.
>>>   E.G. I want node X inserted at 10:24:33, how do I know if the prior
>>>        node is at 10:24:32, or 10:14:32?
>>>
>> With a 60 buckets in the hash-array, you wouldn't be able to tell the
>> difference without comparing the timestamp of the other scheduled items
>> in the list. With more than 600 buckets, they'd end up in separate lists.
>>
>> However, we won't ever have to walk through *all* the lists, we just have
>> to find the right "end" of things.
>>
>> Assume for a moment that the highest normal_check_interval you have for
>> your services is 15 minutes. 1024 buckets will hold your 15 minutes, with
>> 2.067 minutes to spare. Now let's say you want to schedule a service-check
>> x seconds into the future.
>>
>> slot = time(NULL) + x;
>> if (x > num_buckets) {
>> 	walk_from_tail(event_array[slot]);
>> }
>> else {
>> 	walk_from_head(event_array[slot]);
>> }
> 
> Ah ok your hash function is the timestamp. So all items scheduled for
> time X are in the same bucket if the number of buckets exceeds the
> number of seconds from now till the maximum future time you have in
> the list.
> 
> This works because the hash function preserves the order of the keys
> rather than randonly distributing them like a normal has table. I.E.
> 
> key1 = timestamp1
> key2 = timestamp2
> 
> and
> 
>  hash(key1) > hash(key2) if key1 > key2
> 
> so walking the buckets is the same as skipping across the linked list
> in order.
> 
>> Since we're not likely to schedule checks several hours away,
> 
> I have a few checks that run every couple of hours, so I would just
> have a larger hash table, or would you need to have a seperate bin
> that gets the entries that aren't schedulable in the normal hash
> table, and it filters them in on a regular basis.
> 

1 out of 3 scenarios, basically.

1. You'd have a normal-sized hash-table and take the performance-penalty
of sometimes having to walk those checks to reach the case where you
want to schedule another.

2. You'd have a large enough hash-table so the solution when adding
something is always tail->next or a new head.

3. One implements a new event-type which examines the last_check_time
and check_interval and decides if it's time to check now or if it's
time to re-schedule.

Either of these is easy to implement, with nr two being the highest
memory-for-speed tradeoff, although that one would be possible to
implement with a one-way linkage of the list (just as nr 3), so perhaps
it'd even out for everything but the freakish corner cases, where
max_check_interval is several days.

Now that I think of it, nr 3 might be the best way to go, with a
relatively huge list to avoid the expensive corner-cases.

>> and we don't
>> have to walk *past* all items with the same scheduled time as our own event
>> (since the list is going to be sorted based on a non-unique index), the most
>> common case will be that we do only two operations to find our slot.
> 
> Since we can add the new item into the head/tail of the same bucket.
> 

Precisely.

> If the node you are looking for isn't on that list, you go the node
> that has the closest key preceeding the one you want and look to see
> if any of it's N-1...1 pointers are assigned. If so start searching
> along that linked list. At some point you end up searching along the
> 0'th linked list to find the node.
> 

Correct me if I'm wrong, but to me that suggests that the initial search
is made shorter by a high value of N, while there will be fewer levels
to drop with a low value of N.

> I'm not sure how duplicates are handled, but I would require only the
> first node of a duplicate set of nodes to appear in any linked list
> except the 0 list. So you always have a sublist search if you have
> duplicate keys. Do we have duplicate keys in the current linked list
> code?
> 

Nope. Or perhaps yes. The current linked list is key-less, so it's
kinda hard to say ;-)


>>> follow the N forward link to node7, you have overshot, so
>>> follow the link back to node 4 and follow the node4(N-1) forward link
>>> to node5. Then node5((N-1) leads to node7, which overshoots node 6, so
>>> follow the node5(0) link to node6, and you have success.
>>>
>>> To find node 6 is 5 traversals. Which is really not much of a savings,
>>> but with thousands of services I would expect a larger speedup.
>>>
>>> To add a new node, (say nodeA) that falls between node4 and node5 for
>>> example only requires finding the space (3 comparisons,
>>> node1->node4->(node7)->node5), and inserting in the 0 list the new
>>> nodeA.
>>>
>>> To delete a node is the same, but nodes: 1, 3, 4, 5 and 7 are tall
>>> (i.e they exist on multiple lists). So deleting node4 means:
>>>
>>>      node1(N) forward reference points to node4 forward reference
>>>        (i.e. 7)
>>>
>>>      node7(N) reverse reference points to node4's reverse reference
>>>        (i.e. 1)
>>>
>>>      node3(0) forward points to node5
>>>
>>>      node5(0) reverse reference points to node node3.
>>>
>>> On all lists except 0, the nodes that get >< are randomly determined,
>>> and are recalculated on anything that walks the 0 linked list.
>>>
>> Will the performance deteriorate if nothing ever walks that list?
> 
> I claim yes, but how badly, I don't know. It's similar to having an
> imperfect hash and getting more data in a single bucket that you have
> to search.
> 

Ah, right. I think we'd want to avoid doing that at pretty much all
costs. Any known ways around it? Google was remarkably silent on the
subject.

>>> Now we do lose memory from having to maintain pointers for the N...0
>>> lists. I am not sure how many lists have to be kept for M items.  The
>>> skiplist stuff I have done is all in perl using a version of
>>> Algorithm::SkipList which automagically sets the number of linked
>>> lists to use.
>>>
>>> Since in real life you would want to be able to arbitrarily
>>> set/change the number of lists, a real structure might looks like:
>>>
>>>   struct node {
>>> 	node *forward[]; /* forward is an array of pointers to nodes IIRC*/
>>> 	node *reverse[]; /* reverse is an array of pointers to nodes IIRC*/
>>>         data.....
>>>    }
>>>       
>>> where forward and reverse are malloc'd data area's. So you could tune
>>> the number of levels based on how many entries you had. You could even
>>> reallocate the structures growing and shrinking the levels as needed I
>>> think. (Although the bookkeeping for that makes be a bit ill.)
>>>
>> I can imagine. The book-keeping even without that is making me fainly ill.
>>
>> I guess I just don't understand Skiplists properly.
> 
> IIRC skiplists are meant to be a balanced b-tree implemention
> replacement but without having to do all the rebalancing every time a
> node is entered. Since new node insertion only affects the nodes on
> either side of it (unlike in a balanced tree where the whole tree has
> to be restructured to maintain a constant search time) you can have
> multiple threads updating the list compared to some other data
> structures.
> 

Ah, there's that ofcourse, but that'd break down in the odd case when
one thread removes a link after it's been read by another thread,
wouldn't it? Since there's more than one link to and from the object,
I imagine removing it isn't an atomic operation, while reading it
is.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list