Skiplists (was: Re: Logging API revamp)

Andreas Ericsson ae at op5.se
Mon Oct 15 18:19:58 CEST 2007


John P. Rouillard wrote:
> In message <4712B60C.1080602 at op5.se>,
> Andreas Ericsson writes:
> 
>> Andreas Ericsson wrote:
>>> Ethan Galstad wrote:
>>>> For the event queue, I was thinking that a skip list structure might be 
>>>> best for efficiency (http://en.wikipedia.org/wiki/Skip_list).  The event 
>>>> queue is used in primarily two situations:
>>>>
>>>> 1. Popping events from the head of the list to be executed
>>>> 2. Inserting events into the list (mid- or endpoint).
>>>>
>>>> #1 is very efficient with a linked list, but performance with #2 can be 
>>>> quite bad in large lists.  Since a check event usually appears for each 
>>>> host/service that is defined, this can lead to bad performance - O(n^2) 
>>>> I believe - with large installations.  A skip list would bring the 
>>>> performance closer to O(log n).
>>>>
>>>> Anyone have comments/experiences they care to share about the 
>>>> performance of skip lists and/or better alternatives?
>>>>
>>> A skiplist would probably do wonders. I've been experimenting with one
>>> now, actually using the timestamp for when next to execute <action>
>> Scratch that. I only glossed the skiplist page you linked, and
>> misunderstood completely. I've been experimenting with a hash, except
>> the time used for input is the key. Implementation is trivial and
>> performance improvement should be fairly close to or better than a
>> skip-list.
> 
> Hmm, maybe I am missing something, but if I have an element with a
> known key, I see how a hash table pointing into a conventional linked
> list will speed up access. But doesn't case 2 take the form:
> 
>   insert node X that should run at time t into the linked list
> 
> mean a sequential scan of the linked list? How do you know what time
> to hash to find the node before time t?

It's a sequential scan of a small part of the linked list, just like
any non-perfect hash. Since perfect hashes aren't available for this,
we'll have to use a non-perfect one.


> 
>   E.G. I want node X inserted at 10:24:33, how do I know if the prior
>        node is at 10:24:32, or 10:14:32?
> 

With a 60 buckets in the hash-array, you wouldn't be able to tell the
difference without comparing the timestamp of the other scheduled items
in the list. With more than 600 buckets, they'd end up in separate lists.

However, we won't ever have to walk through *all* the lists, we just have
to find the right "end" of things.

Assume for a moment that the highest normal_check_interval you have for
your services is 15 minutes. 1024 buckets will hold your 15 minutes, with
2.067 minutes to spare. Now let's say you want to schedule a service-check
x seconds into the future.

slot = time(NULL) + x;
if (x > num_buckets) {
	walk_from_tail(event_array[slot]);
}
else {
	walk_from_head(event_array[slot]);
}


Since we're not likely to schedule checks several hours away, and we don't
have to walk *past* all items with the same scheduled time as our own event
(since the list is going to be sorted based on a non-unique index), the most
common case will be that we do only two operations to find our slot. The
above being one, and the below being the second

if (walk_from_head)
	if head->when >= our->when /* bingo, most common case */
else
	if tail->when <= our->when /* bingo again, second most common case */

Only when we have items scheduled more than (num_buckets*2)+1 seconds into
the future and we want to schedule an event where

	num_buckets < (when - now) < (num_buckets*2)+1

will we end up actually walking *anything*, and since we're more or less
guaranteed to start in the right end when searching (as the scheduling
queue usually gets more and more saturated the closer to "now" we come),
we shouldn't, in practice, ever walk any large amount of objects. Only
scheduled downtime entries should really ever end up in the scheduling
queue with a (when - now) time larger than (num_buckets*2)+1, and those
are much rarer than your average service checks.

One good thing about this is that it's easy to check and get figures
from, and since the implementation is so trivial, we should be able
to have it up and running in no-time.

Memory overhead is low ( sizeof(ptrdiff_t)*num_buckets ), making it
12KiB larger than today for our example of 1024 buckets on a 32-bit
system.

>> In particular, I'm worried about this snippet:
>>
>> ---%<---%<---%<---
>> Insertions and deletions are implemented much like the corresponding
>> linked-list operations, except that "tall" elements must be inserted
>> into or deleted from more than one linked list.
>> ---%<---%<---%<---
>>
>> I'm sure it'd be good if one gets it completely right, but it sounds
>> non-trivial, and with a large object base (100k list-nodes or more),
>> I'm guessing that "inserted into or delete from more than one
>> linked list" will eat up the performance gains, unless one sacrifices
>> lots of memory to keep a very large number of "buckets", although I
>> fail to see how that would help, since one would then have to walk
>> a larger number of lists in order to find the right one.
> 
> Does this agree with what you are thinking? As I understand skip
> lists, you have a node that looks like:
> 
>      struct node {
>           listN_forward * node;
>           listN_reverse * node;
>           listN-1_forward * node;
>           listN-1_reverse * node;
> 	  ...	
>           list0-forward * node;
>           list0-reverse * node;
>           data...
>      }
> 
> Please exclude the explicit hard coded lists pointers.
> 
> where my "listN" is your bucket (note some (most) implementations use
> only forward and not reverse pointers). So a single node can be on
> multiple lists. Then you have something like:
> 
> 
> 
>          node1  node2   node3  node4  node5  node6  node7
> 
> N          <--------------------><------------------->  
> 
> N-1        <-------------><-------------><----------->
> 
> N-2        
> 
> ...
> 
> 0          <------><-----><------><-----><-----><---->
> 
> So to find node3, you search the N linked list starting a node1, and
> end up at node 4 (by following the ----> forward pointer).

Why? Sorry, but I fail to see why node1[0]->fwd == node4. Hoping to
learn something here, and not arguing :)

> Since
> node4's key is > node 3's key, you realize you have overshot node3, so
> you go back to node1 and search the linked list N-1.
> 
> Following the forward link from node1(N-1) leads you right to node3.
> 
> If you wanted node6, again start at node1 follow the N forward link to
> node4,

So each scheduld event has 2*N*sizeof(ptrdiff_t) overhead? Hmm. I imagine
that'll be quite huge for large datasets.

> follow the N forward link to node7, you have overshot, so
> follow the link back to node 4 and follow the node4(N-1) forward link
> to node5. Then node5((N-1) leads to node7, which overshoots node 6, so
> follow the node5(0) link to node6, and you have success.
> 
> To find node 6 is 5 traversals. Which is really not much of a savings,
> but with thousands of services I would expect a larger speedup.
> 
> To add a new node, (say nodeA) that falls between node4 and node5 for
> example only requires finding the space (3 comparisons,
> node1->node4->(node7)->node5), and inserting in the 0 list the new
> nodeA.
> 
> To delete a node is the same, but nodes: 1, 3, 4, 5 and 7 are tall
> (i.e they exist on multiple lists). So deleting node4 means:
> 
>      node1(N) forward reference points to node4 forward reference
>        (i.e. 7)
> 
>      node7(N) reverse reference points to node4's reverse reference
>        (i.e. 1)
> 
>      node3(0) forward points to node5
> 
>      node5(0) reverse reference points to node node3.
> 
> On all lists except 0, the nodes that get >< are randomly determined,
> and are recalculated on anything that walks the 0 linked list.
> 

Will the performance deteriorate if nothing ever walks that list?

> Now we do lose memory from having to maintain pointers for the N...0
> lists. I am not sure how many lists have to be kept for M items.  The
> skiplist stuff I have done is all in perl using a version of
> Algorithm::SkipList which automagically sets the number of linked
> lists to use.
> 
> Since in real life you would want to be able to arbitrarily
> set/change the number of lists, a real structure might looks like:
> 
>   struct node {
> 	node *forward[]; /* forward is an array of pointers to nodes IIRC*/
> 	node *reverse[]; /* reverse is an array of pointers to nodes IIRC*/
>         data.....
>    }
>       
> where forward and reverse are malloc'd data area's. So you could tune
> the number of levels based on how many entries you had. You could even
> reallocate the structures growing and shrinking the levels as needed I
> think. (Although the bookkeeping for that makes be a bit ill.)
> 

I can imagine. The book-keeping even without that is making me fainly ill.

I guess I just don't understand Skiplists properly. I'll go ahead and
write up my timeslot list and get some numbers from that. In theory it
performs worse, as you'll have large linked lists, but in practice it
shouldn't matter as you'll only add to the endpoints of them.

Ah well. Thanks for the explanation. If nothing else, it convinced me
it's time to go home and sleep ;-)

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/




More information about the Developers mailing list