http://qs1969.pair.com?node_id=480100


in reply to Re^2: RFC on Inline::C hack: Hash_Iterator
in thread RFC on Inline::C hack: Hash_Iterator

So basically HeNEXT scoots along a linked list of hash entries (those belonging to a given bucket, to be precise).

Well.. you could always make use of a doubly-linked list to get around this. For the sake of the storage of a second pointer in your structure, this would seem a reasonable trade-off to give a more flexible iteration method.

Just on a scalability note, and to raise awareness of a possible stumbling block if this code were to be used as anything other than an iterator, I'm always wary of a hash bucket containing a linked list. In this example, there's little to no issue - it's designed to be an iterator, and hence the list will always be traversed from start to end.

For (most) implementations, however, where more random access of hashed elements is required, an iterative method is really quite inefficient, requiring up to time const + N to look up any entry. In these cases, it's often better to use something akin to a binary tree, offering at worst time const + log N to find any entry.

For true flexibility, a linked list threaded through binary tree has been my structure of choice for awhile - offering a sane method of iterating through the structure while retaining a reasonable random-element-access time.

Nice work - ++tlm.

Replies are listed 'Best First'.
Re^4: RFC on Inline::C hack: Hash_Iterator
by demerphq (Chancellor) on Aug 02, 2005 at 11:03 UTC

    Well.. you could always make use of a doubly-linked list to get around this. For the sake of the storage of a second pointer in your structure, this would seem a reasonable trade-off to give a more flexible iteration method.

    The structures you are talking about are defined by Perl itself. Its not a design decision available to anybody but the pumpkings, and its unlikely they would appreciate the cost given the minimal benefits it would provide.

    For (most) implementations, however, where more random access of hashed elements is required, an iterative method is really quite inefficient, requiring up to time const + N to look up any entry. In these cases, it's often better to use something akin to a binary tree, offering at worst time const + log N to find any entry.

    Im not sure if I agree with this analysis. The linked lists used for buckets in perls hashes are intended to be extremely small, ie, generally they should hold only one element, and except for degenerate cases should not really exceed two elements. With this in mind a binary tree approach makes less sense as in most cases you will derive no benefit from it at all.

    Perls hashes dont allow for duplicate keys, which means that buckets only contain multiples when there are hash-key collisions. Such collisions should be unusual, and overly long bucket chains will be redistributed to other buckets when a resize event occurs, and iirc overly long bucket chains are precisely the determinant for such resize events.

    ---
    $world=~s/war/peace/g

      The structures you are talking about are defined by Perl itself.

      I didn't know that, and I find it interesting. The cost involved to point back to the previous record is effectively negligable; however, I agree that the benefits are also minimal.

      The linked lists used for buckets in perls hashes are intended to be extremely small, ie, generally they should hold only one element, and except for degenerate cases should not really exceed two elements. With this in mind a binary tree approach makes less sense as in most cases you will derive no benefit from it at all.

      Agreed, from a Perl perspective. I should point out that I was aiming to talk in a more general way with that comment - though I accept that I didn't make that clear. My experience is largely with Other Languages that perhaps don't have such clever internals as Perl and where the underlying data structures can make the difference between terrible and acceptable performance speeds for extremely heavily loaded hash tables.

        My experience is largely with Other Languages that perhaps don't have such clever internals as Perl and where the underlying data structures can make the difference between terrible and acceptable performance speeds for extremely heavily loaded hash tables.

        Im guessing that you mean scenarios where you have to hand code your own hash table implementations. Im also guessing that you mean scenarios where you have to work with statically sized hash tables. In such a scenario I could see your point for sure. A hash table of small threaded binary trees sounds like a good design to me.

        However, just for your edification ill outline in general how perls hashes work: first, the size of the hash table is always a power of 2, starting at 8 elements large. When the bucket chain length starts getting too long (calculated I beleive by determining the ratio of the number of keys to the number of buckets) the size of the hash array is doubled and the keys of the original are remapped into the new hash array. The hash values are not recalculated as the power of two rule implies that the remapping can occur simply by anding a different bit mask with the hash values to determine the new slot in the array. In normal circumstances the actual keys are stored only once, in a master hash, with pointers from the buckets of the actual hash buckets to the master hashes buckets (which actually contain the key string). This key sharing is important as hashes are the most common way of representing objects which will by-and-large tend to have many keys in common.

        ---
        $world=~s/war/peace/g