Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
I have done preliminary research into various different types of binary trees and none of them seem to fit all my requirements without modification. Since raw speed is also a necessity, anything I prototype in Perl will ultimately be ported to C. Here are my requirements:

Amazingly enough, Perl arrays fit all of these requirements except that splice is O(N). Balanced trees are faster but are also more complex and wouldn't meet some of the other requirements without some modification.

I also have a derived requirement. I will be finding two items using separate binary searches. If the distance between the two items fits a certain criteria, I will be traverse between the two items. I do not want to have to perform the first binary search again so I would like to be able to remember it and start there.

What I am looking for is a specific suggestion on a datastructure that you are confident that can be modified to fit all my requirements. As I indicated above, I have done surface research on various different datastructures and none fit the bill out of the box. They are complicated enough that I don't want to spend my time unraveling their mysterious unless I am confident they are worth the effort.

One final note - the number of items will never exceed 1000.

Cheers - L~R

Replies are listed 'Best First'.
Re: Advanced Data Structure Question
by jhourcle (Prior) on Apr 19, 2006 at 15:01 UTC

    All of the items qualify as requirements, except for the comment about 'fast' -- that's just a guideline. You need to determine what minimum performance is acceptable (ie, it can perform all of the operations you listed, and within whatever time you decide is acceptable)

    O(N) isn't the greatest, but if that's your only issue, it still might be the best implementation, especially when you're dealing with small (<1000) arrays. (small being a relative thing ... which is why we need hard numbers)

    You already have your main considerations for decision making -- now decide on what the acceptable times are, and/or relative scoring. For example, you might decide that it has to be able to run some given test within (x) sec -- and the test might run just the add items, or the remove items, or it might run a combination of traversal, add/removal, searching, distance determination, etc.

    I'd also suggest forgetting about your derived requirement -- just consider it in terms of speed. It's possible that an implementation that requires an extra search will be faster than a bad implementation that can 'remember' the position. (and if it's a condition that doesn't happen frequently, the speed savings for not remembering might make up for the extra searches)

Re: Advanced Data Structure Question
by Zaxo (Archbishop) on Apr 19, 2006 at 14:19 UTC

    I think a plain old binary tree should do fine with this.

    Since you have a limited number of elements, asymptotic performance is not as important as you think. Fast enough will stay fast enough if the data size can't grow.

    After Compline,
    Zaxo

      Zaxo,
      What exactly is a plain old binary tree? I ask because I am using Mastering Algorithms With Perl and it suggests that a plain old binary tree is only balanced on initialization. After enough modifications have been made it may be no more efficient than a linked list. Perhaps this is just the book's way of introducing various methods of balancing by demonstrating a need and, in reality, all binary trees are assumed to have inheritent balancing.

      Cheers - L~R

Re: Advanced Data Structure Question
by traveler (Parson) on Apr 19, 2006 at 14:39 UTC
    I seem to recall from my compsci classes some years ago :-) that you generally trade insertion speed for access speed. That said, it seems that your basic data structure maybe should be a "traditional" (not perl) hash. That is: buckets that hold multiple values. a hash function. You could then make the structure of the buckets be arrays, trees or dequeues depending on what your incoming data looks like and so forth. With the small size, arrays are likely to make good buckets. I say that because if you mke 1000 buckets, there will be few collisions. You should also store the size of each bucket in order to make the distance computations fast.

    Why is "binary search" important for find? The hash, of course, does not provide that at the highest level, but could within the buckets. The hash is likely to be faster than the b-search.

    HTH, --traveler

      traveler,
      Given an ordered list, a binary search is the fastest way to find an element. I didn't realize there was a way to use a hashing function and preserve order as well. I thought, apparently incorrectly, that those two things were in conflict with one another. I guess I will look into hashing functions further if you feel they could do the job.

      Cheers - L~R

        Given any list, a hashed search is the fastest way to find an element because it's generally O(1) (unlike O(logn) for binary searches), particularly if you have a limit on the number of elements as you do. You can use 1001 as your modulus and (with a little finagling) guarantee uniqueness in your buckets.

        If you need to maintain order, do so by another data structure. In almost all cases, you can trade RAM for CPU.


        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
        There are ordered hash tables. Knuth did some work with them. You can also have a parallel array with the order information.
Re: Advanced Data Structure Question
by Limbic~Region (Chancellor) on Apr 19, 2006 at 14:10 UTC
    All,
    Since I am still a C neophyte and an array fits so many of my requirements, I am thinking that it may be the way to go. I don't have to do anything for the Perl prototype and the C implementation doesn't require rocket science.
    • shift/pop are a simple matter of changing the logical end points
    • unshift/push can be optimized if there is room between logical and real end points
    • splice will require swapping values from the end to the insertion point perhaps through memcpy
    It just isn't blazingly fast in comparison to binary trees.

    Update:
    I discussed this with ambrus further in the CB. There are optimizations available that wouldn't be otherwise for a general solution that I can capitalize on. I will start out with a large real array and place my logical array in the middle.

    Cheers - L~R

Re: Advanced Data Structure Question
by cdarke (Prior) on Apr 19, 2006 at 14:12 UTC
    "anything I prototype in Perl will ultimately be ported to C"
    Perl and C arrays are quite a bit different. Algorithms like this written for Perl will not necessarily perform well in other languages. This applies particularly to C and performance.
    You might find it quicker to write in C in the first place, otherwise you might find yourself writing Perl array handling in C, again.
      cdarke,
      I am confused - how does skipping a Perl prototype avoid duplicating Perl array handling in C? Perl's arrays are really efficient at doing certain operations and those are the same operations I am going to need to do in C if I decide to use an array as my datastructure. The point of the Perl prototype is so that I can actually see what I need to do. Besides, there is more to the project than just this data structure.

      Cheers - L~R

        My point is that you can't do the same thing with C arrays as with Perl arrays. Of course Perl is written in C so anything you can do in Perl can be done in C, except there may be a more efficent way of handling things in C.
Re: Advanced Data Structure Question
by blokhead (Monsignor) on Apr 19, 2006 at 16:28 UTC
    I didn't have time to post this right away this morning, but here's at least some thoughts about efficiently measuring the "distance" between two nodes. Now I see you will probably go in a simpler direction, but I'm posting this anyway in case you (or anyone else) feel the urge to implement such a structure.

    Once you have a suitable balanced binary tree, you can augment it with threading to get a fast inorder traversal starting from any point. You can also augment it in the following way so that given two keys, you can find them in the tree in the normal way, but at the same time compute the "distance" between them at no extra cost.

    Add two slots for each node of the tree, to store the number of left descendants and right descendants. When inserting, you can do the increments while traversing the tree to find the insertion spot, just before following a child pointer. When deleting, you decrement appropriate values all the way down to the node. If your balanced tree implementation requires e.g. rotations every once in a while, bookkeeping the number of left/right descendants will be easy to maintain for these operations as well.

    Now, using this extra bookkeeping, here's how you can get pointers to X and Y (X < Y) and at the same time get the number of nodes "between" X and Y:

    • In parallel, keep track of two node pointers, PX and PY, and traverse down the tree to find X and Y respectively.
    • PX and PY both start at the root. At the first point in time where they diverge (PX = PY but PX is about to turn left and PY is about to turn right), set N=1
    • Thereafter, every time PX is about to turn left, add (1+PX.right_descendants) to N. And every time PY is about to turn right, add (1+PY.left_descendants) to N.
    • When PX and PY finally reach X and Y respectively, the value N will be the number of nodes between X and Y in the ordering.
    Alternatively, if you already have pointers to X & Y, you could do this "in reverse", working back up towards the root. But I found it simpler to describe working from the top down. Anyway, this gives a O(log n) algorithm for their "distance".

    blokhead

Re: Advanced Data Structure Question
by dokkeldepper (Friar) on Apr 19, 2006 at 15:21 UTC

    Hi,

    use an array to keep the values, create a B-Tree to do the search. However instead of the values keey in the tree-nodes the index of the array elements you search for.

    The hardest thing here is to keep the array in sync with the tree.

    Basically it is something like an indexed array.

    update: Uaaa, I missed the line with "never more than 1000 elements". Then my proposal is a total overkill. Isn't there a chapter in PBP about benchmarking?

Re: Advanced Data Structure Question
by samtregar (Abbot) on Apr 19, 2006 at 18:22 UTC
    Are you sure the naive approach is too slow? Have you Benchmarked it?

    Sure, splice is O(n), but if n=1000 that's still pretty fast! Implement it in C if you want, but I'm not sure you'll gain much. I think a tree is likely to be overkill here.

    -sam

Re: Advanced Data Structure Question
by sgifford (Prior) on Apr 19, 2006 at 15:44 UTC
    DB_File's $DB_BTREE database should be a quick way to get a binary tree up and running. You can create the database in-memory, which is likely to be pretty fast.
Re: Advanced Data Structure Question
by eric256 (Parson) on Apr 19, 2006 at 16:48 UTC

    I'm missing why any tree implementation doesn't fit your needs. They can be traversed in order (forwards or backwards), they are in order if you traverse them in order, they are quick to add and remove, they are quick to find elements and distance should just be a traversal counting nodes on the way (i think). It is hard to say without any idea what the data you are storing is and wehatehr you are talking about the data being in order or some form of index.


    ___________
    Eric Hodges
Re: Advanced Data Structure Question
by TedPride (Priest) on Apr 19, 2006 at 17:59 UTC
    I'm sort of wondering too why trees won't work. A regular tree can have problems given worst-case input, but a 2-3-4 tree fits all of the requirements reasonably well. Depending on your needs, you may also want to look into a regular binary tree with insert at root, which while it does not defend against worst-case input, does keep the most recently added items closest to the root, where they can be accessed most quickly.

    Or you could just use a regular binary tree, but rebalance it every x number of items. Assuming you do the rebalancing efficiently (and I'm doing my calculations correctly), rebalancing should be a O(n) process, meaning you can run it every so often without too big a loss in efficiency.

    Bottom line, you're making a mountain out of a molehill if it's only 1000 items. You can use arrays and nobody will be able to tell the difference in your prototype. Later on, I'm sure there are 2-3-4 or red/black tree classes for C that you can use with minimal trouble, or you can borrow any algorithms book and write your own class in a few hours.

      Or AVL trees, for example. These make inserting and searching very fast if you access the same datum shortly after a previous acess. If the accesses are truly random, you might check out a self-balancing tree.

      --traveler

Re: Advanced Data Structure Question
by educated_foo (Vicar) on Apr 19, 2006 at 15:40 UTC
    STL sets and iterators, with a custom comparison function that calls back into Perl, satisfy your requirements if you can use C++. They're probably red-black trees under the hood (they are in libg++).
Re: Advanced Data Structure Question
by spiritway (Vicar) on Apr 20, 2006 at 02:21 UTC
    One final note - the number of items will never exceed 1000.

    I can't tell you how many times I've been bitten by things like this... the check is in the mail ;-)

      spiritway,
      This is not a general purpose solution. It is addressing a specific mathematical problem that by its definition can't exceed 1000. Normally when I think something is bounded that I have no control over I stipulate that it should not <fill in the blank> and that I would want to know about it if it does.

      Cheers - L~R

        Well, I meant no harm or insult in my comment. I've just found that when I write something for a very limited purpose, it often happens that I eventually come to regret making it so limited. Don't let me spoil the fun, though - if your example won't go beyond 1000 elements, then that's a useful bit of information. Good luck.

Re: Advanced Data Structure Question
by BrowserUk (Patriarch) on Apr 20, 2006 at 06:58 UTC

    Will all your values be unique?

    Update: Ignore this dumb question. You talked of using a binary search.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Advanced Data Structure Question
by Scott7477 (Chaplain) on Apr 20, 2006 at 05:33 UTC
    Are there any languages besides C that might give the performance improvement that Limbic~Region is looking for here?
    Perhaps Fortran (I've never used it personally)? A quick Googlesearch revealed this link, the abstract of which indicates that Java could potentially be only 30% slower than Fortran or C for some types of numerical calculations.

    It just seems strange to me that C seems to have been the default science/math language for so long. People are spitting out dynamically typed languages seemingly left and right (JavaScript, VBScript, Python, Perl). I've never written a programming language, but for a group of folks with the right background would creating an alternative to C require much greater effort than that which has been put into the development of Perl?