Re^3: Perl's hash table implementation

Have you done any benchmarking experiments comparing hash table implementations? You know! Experiments with lots of unique additions. Experiments that involve lots of collisions (hash key already exists). Deletion operations. Conditional deletion operations etc.

In general, any descent hash table implementation worthy of the name, is effectively (amortised) O(1) for all those operations. But of course, the devil is in the detail. As general purpose, anything you like as a key and anything you like as a payload, Perl's implementation is (probably) about as good as it gets. It main downside is that it's performance and generality comes at the cost of substantial memory usage. A worthwhile trade when speed and generality are the major criteria.

Can Perl's hash implementation be beat? Could it be optimised?

It absolutely depends upon the usage you are making of them; and for what criteria you are trying to optimise.

If you are using the hash purely for lookups (exists), and don't need an associated value, considerable memory can be saved by not having unused pointer for values.
If your keys are formed from a small dictionary--ie. some significantly smaller subset of bytes 0 .. 255, then it may be possible to a) find a faster hashing algorithm; b) save space on key storage by compacting the keys.
It might also be the case that an alternative data structure--like tries or Judy arrays--might be faster because of their better Ln cache locality.
If you know the maximum size that your hash will grow to, then a finer granularity growth strategy might save considerable memory.
For example. Say you knew that you were going to have a maximum of 90 million keys. You can set keys %hash = 90e6;, but perl's hash growth strategy will round that up to the next power of two: 2²⁷ == 134217728. Which is quite wasteful, especially when you consider that many of your keys will hash to duplicates and so live in the bucket chains rather than separate buckets.
You can potentially avoid that by sizing to the next lower power of two: 2²⁶ == 67108864, and hope that you have enough duplicate hashings that you don't push over the 3/4 fill ratio. But that's a big risk. Just one more and the whole hash has to be resized. So, for some purposes and manual re-size mechanism would be preferable. Or perhaps a dynamic mechanism that increases the fill ratio as the size increases.
But different strategies can require different implementations, and they are always trading something for something else. For example, Perl uses powers of 2 sizes in part because it makes the modulus operation a simple bitwise and operation rather than a more expensive division that would be required for non-power-of-two table sizes.
The difference might seem small on modern processors, but it impacts every insertion, deletion, and lookup, and can have a significant impact when resizing has to occur.

My current effort have 3 aims: 1) to trade (a little) speed for significant memory reduction, by reducing the size of the internal structures used; 2) further reduce memory consumption by restricting keys to 255 characters thereby allowing me to get away with a single byte for key length; 3) avoid the need to use memory and cpu to sort the keys, by implementing the hash such that it supports in order traversal of the keys. All, whilst retaining the generality of key containing any byte values & supporting any SV as a payload. I'm currently bogged down in the typical (for me) XS memory & refcount management issues.

What about on disk Hash tables using MLDBM? Could this be made to run any faster? How does it compare with using a database equivalent (e.g. MySQL)?

I don't think performance was a high criteria for the implementation of MLDBM. I think they were aiming for a) basic CRUD behaviour (no referential constraints or normal form requirements); b) with nested hash-like ease of use (avoiding SQL complexities and bottlenecks); c) whilst retaining ACID reliability in a multi-user environment, with single-file simplicity.

And I think they've got pretty close to those goals. To target speed would be to compromise their other goals.

The bottom line is that it is nearly always possible to beat out an RDBMS for performance when you have specific knowledge of the data and usage requirements. But doing it requires custom effort for each specific set of requirements. RDBMSs (and Perl's hashes) are aimed at performing well across the widest possible range of general requirements.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Comment on Re^3: Perl's hash table implementation Download Code