Perl's hash table implementation

jc has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Perl's hash table implementation by BrowserUk (Patriarch) on Oct 07, 2010 at 08:31 UTC
The clearest picture you'll see of how Perl's hashes are implemented is this one. For a little associated description, see PerlGuts Illustrated and scan down the section entitled "HV". If you have specific questions about the detail, I'd be willing to try to answer them. With the proviso that I'm just an "external observer" who happens to currently be writing an XS implementation of a hash-like mechanism, and am leaning heavily upon the perl hash code as a reference. I've basically become very familiar with it over the last 2 or 3 weeks. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^2: Perl's hash table implementation by Anonymous Monk on Oct 07, 2010 at 08:47 UTC
current version http://search.cpan.org/dist/illguts/index.html	[reply]
Re^3: Perl's hash table implementation by BrowserUk (Patriarch) on Oct 07, 2010 at 09:01 UTC
Oh! That's nice++ I especially like the "click to see others". Very cool.I've replaced my bookmark. Thank you so much for that. (The link, and the effort of updating the source if you are the one responsible.) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply]
Re^2: Perl's hash table implementation by jc (Acolyte) on Oct 07, 2010 at 09:21 UTC
Great! Sounds like we have common interests then. Have you done any benchmarking experiments comparing hash table implementations? You know! Experiments with lots of unique additions. Experiments that involve lots of collisions (hash key already exists). Deletion operations. Conditional deletion operations etc. I'd be very interested in hearing your opinions. Can Perl's hash implementation be beat? Could it be optimised? What about on disk Hash tables using MLDBM? Could this be made to run any faster? How does it compare with using a database equivalent (e.g. MySQL)?	[reply]
Re^3: Perl's hash table implementation by BrowserUk (Patriarch) on Oct 07, 2010 at 11:31 UTC
Have you done any benchmarking experiments comparing hash table implementations? You know! Experiments with lots of unique additions. Experiments that involve lots of collisions (hash key already exists). Deletion operations. Conditional deletion operations etc. In general, any descent hash table implementation worthy of the name, is effectively (amortised) O(1) for all those operations. But of course, the devil is in the detail. As general purpose, anything you like as a key and anything you like as a payload, Perl's implementation is (probably) about as good as it gets. It main downside is that it's performance and generality comes at the cost of substantial memory usage. A worthwhile trade when speed and generality are the major criteria. Can Perl's hash implementation be beat? Could it be optimised? It absolutely depends upon the usage you are making of them; and for what criteria you are trying to optimise. If you are using the hash purely for lookups (exists), and don't need an associated value, considerable memory can be saved by not having unused pointer for values. If your keys are formed from a small dictionary--ie. some significantly smaller subset of bytes 0 .. 255, then it may be possible to a) find a faster hashing algorithm; b) save space on key storage by compacting the keys. It might also be the case that an alternative data structure--like tries or Judy arrays--might be faster because of their better Ln cache locality. If you know the maximum size that your hash will grow to, then a finer granularity growth strategy might save considerable memory. For example. Say you knew that you were going to have a maximum of 90 million keys. You can set `keys %hash = 90e6;`, but perl's hash growth strategy will round that up to the next power of two: 2²⁷ == 134217728. Which is quite wasteful, especially when you consider that many of your keys will hash to duplicates and so live in the bucket chains rather than separate buckets. You can potentially avoid that by sizing to the next lower power of two: 2²⁶ == 67108864, and hope that you have enough duplicate hashings that you don't push over the 3/4 fill ratio. But that's a big risk. Just one more and the whole hash has to be resized. So, for some purposes and manual re-size mechanism would be preferable. Or perhaps a dynamic mechanism that increases the fill ratio as the size increases. But different strategies can require different implementations, and they are always trading something for something else. For example, Perl uses powers of 2 sizes in part because it makes the modulus operation a simple bitwise and operation rather than a more expensive division that would be required for non-power-of-two table sizes. The difference might seem small on modern processors, but it impacts every insertion, deletion, and lookup, and can have a significant impact when resizing has to occur. My current effort have 3 aims: 1) to trade (a little) speed for significant memory reduction, by reducing the size of the internal structures used; 2) further reduce memory consumption by restricting keys to 255 characters thereby allowing me to get away with a single byte for key length; 3) avoid the need to use memory and cpu to sort the keys, by implementing the hash such that it supports in order traversal of the keys. All, whilst retaining the generality of key containing any byte values & supporting any SV as a payload. I'm currently bogged down in the typical (for me) XS memory & refcount management issues. What about on disk Hash tables using MLDBM? Could this be made to run any faster? How does it compare with using a database equivalent (e.g. MySQL)? I don't think performance was a high criteria for the implementation of MLDBM. I think they were aiming for a) basic CRUD behaviour (no referential constraints or normal form requirements); b) with nested hash-like ease of use (avoiding SQL complexities and bottlenecks); c) whilst retaining ACID reliability in a multi-user environment, with single-file simplicity. And I think they've got pretty close to those goals. To target speed would be to compromise their other goals. The bottom line is that it is nearly always possible to beat out an RDBMS for performance when you have specific knowledge of the data and usage requirements. But doing it requires custom effort for each specific set of requirements. RDBMSs (and Perl's hashes) are aimed at performing well across the widest possible range of general requirements. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. RIP an inspiration; A true Folk's Guy	[reply] [d/l]
Re^3: Perl's hash table implementation by Anonymous Monk on Oct 07, 2010 at 21:26 UTC
Here is something that might be of some help Here	[reply]
Re: Perl's hash table implementation by Corion (Patriarch) on Oct 07, 2010 at 08:22 UTC
Hashes live in `hv.h` and `hv.c`. I'm not sure what you want to do and how much knowledge of the rest of Perl core is necessary to get there. Mostly, you'll be looking at macros using other macros in `hv.h`, but if you only aim at the implementation of Perl (memory) hashes, `hv.c` should be enough, and mostly "real" C code to boot.	[reply] [d/l] [select]
Re^2: Perl's hash table implementation by jc (Acolyte) on Oct 07, 2010 at 09:11 UTC
Hi, Basically, the kind of programming I do involves a lot of hash table work. This makes Perl an attractive technology for me because of its simple hash table interface. When doing an MSc in High Performance Computing I was not allowed to use Perl because my dissertation could not have been taken seriously because, and I quote from my supervisors, 'only C and Fortran are taken seriously as High performance programming languages'. So for my disssertation I had to use a less simple hash table C interface like http://www.cl.cam.ac.uk/~cwc22/hashtable/ And so, in a nutshell, I was just interested in comparing these Hash table implementations. Anyway, thanks for pointing me in the right direction.	[reply]
Re^3: Perl's hash table implementation by JavaFan (Canon) on Oct 07, 2010 at 09:33 UTC
This makes Perl an attractive technology for me because of its simple hash table interface. Yeah, but the implementation is focussed on Perl. It's certainly not a wrapper around an external C library. It won't be simple to lift the hash implementation out of perl, and adapt it for another language, and still have something that blends well with that language. only C and Fortran are taken seriously as High performance programming languages There may be a few others (or not...), but Perl certainly isn't known for being suitable as a "high performance" language when it comes to number crunching, or dealing with massive amounts of in-memory data.	[reply]
Re^3: Perl's hash table implementation by Anonymous Monk on Oct 07, 2010 at 12:20 UTC
I don't know if it would be useful for your purposes, but you might be interested in PDL: PDL is the Perl Data Language, a perl extension that is designed for scientific and bulk numeric data processing and display. It extends perl's syntax and includes fully vectorized, multidimensional array handling, plus several paths for device-independent graphics output. PDL is fast, comparable and often outperforming IDL and MATLAB in real world applications. PDL allows large N-dimensional data sets such as large images, spectra, etc to be stored efficiently and manipulated quickly.	[reply]
Re: Perl's hash table implementation by halfcountplus (Hermit) on Oct 07, 2010 at 14:48 UTC
This is more about C and hash tables than perl. (I have not looked at the source, but the name "hash", would strongly imply that they are implemented using a hash table in C.) Maps from the C++ STL -- which are like perl hashes, and are very high performance --- are done using red black trees. Red black trees are based on Bayer trees with a limit of 2 keys per node and have a big O of log(n) for lookups. Fast databases commonly use Bayer trees for hash like operations. I've written a hash table in C for large data sets that performs comparably to the STL's "map" using Bayer trees with 4+ keys per node. The idea there is that you have a hash table of trees, and this works very well -- I am certainly not the only person to do it. So unless anyone knows more specifically exactly what kind of data structure is used in the perl source, keep in mind those three possibilities: Straight red-black (or other) tree. Normative hash table of arrays. Higher performance hash table of trees.	[reply]
Re: Perl's hash table implementation by locked_user sundialsvc4 (Abbot) on Oct 07, 2010 at 17:44 UTC
The “performance” of a hash-table will always depend very heavily upon the exact distribution of key-values presented. If the hash function is favorable to the actual key-data, distributing those values equitably among the buckets, performance of almost any table will be “good;” otherwise not. A lot depends also upon what you are measuring. Are you, say, looking for the time it takes to locate a particular key that (is\|is not) there? The time required to do an insert or delete? The time required (and the algorithm) to resolve collisions? Memory can also be a factor. Hash tables often have a big footprint. Page faults can be a problem, rendering the hash-table worst case pretty close to “a disk file.” But... algorithms meant to assure good performance in the face of frequent paging-activity might be deemed to be “contributing to unnecessary overhead” if the tests were performed on a memory-plentiful system. The way that you frame your question has a great deal to do with what is the “right” answer.
Re^2: Perl's hash table implementation by jc (Acolyte) on Oct 07, 2010 at 20:04 UTC
Hi, I'll make my questions a little bit clearer with some examples. The two main applications I have developed and experimented with were an n-gram counter and a high performance of the EM algorithm for automatically learning a translation probability table from bitexts. Let's consider counting n-gram occurrence in large document collections. If you are counting unigrams the vast majority of queries are collisions (the word has already been encountered and the count needs to be incremented). Performance problems are usually due to having to increase the number of buckets in the hash table due to an underestimation of how many buckets you would need. The longer the n-gram the less collisions, of course. The problem with collisions is that you have to wait for the result before you can make the decision to update the frequency. Now the EM implementation was built on HECToR, the largest super computer for academic purposes in Europe. HECToR has swap space disabled to keep applications memory bound. My implementation made careful memory usage counts to stop the hash table from growing too big. Again, due to the high cooccurrence of high frequency words in the bitexts in the training data the vast majority of queries were collisions. Again, the application was spending too much time waiting for the response to make the decision to update the most frequently occurring values in the table. Another interesting side effect of this was that any attempt to speed up my highly optimised version of this algorithm by using more processes in the shared memory space actually slowed it down. This was because serial optimisations based on cache usage go down the swanee when you start to use more cores. But this is by the by. In conclusion, I guess I'm more interested in seeing how different implementations compare in the case of a high collision rate. But, I'm also interested in general. I suppose a variety of benchmarks should be used. The only thing I'm not really interested in in is pageing issues. I have no interest in letting my hash tables overflow into swap space. In fact, exactly the opposite.	[reply]