in reply to Perl's hash table implementation

The “performance” of a hash-table will always depend very heavily upon the exact distribution of key-values presented.   If the hash function is favorable to the actual key-data, distributing those values equitably among the buckets, performance of almost any table will be “good;” otherwise not.

A lot depends also upon what you are measuring.   Are you, say, looking for the time it takes to locate a particular key that (is|is not) there?   The time required to do an insert or delete?   The time required (and the algorithm) to resolve collisions?

Memory can also be a factor.   Hash tables often have a big footprint.   Page faults can be a problem, rendering the hash-table worst case pretty close to “a disk file.”   But... algorithms meant to assure good performance in the face of frequent paging-activity might be deemed to be “contributing to unnecessary overhead” if the tests were performed on a memory-plentiful system.

The way that you frame your question has a great deal to do with what is the “right” answer.

Replies are listed 'Best First'.
Re^2: Perl's hash table implementation
by jc (Acolyte) on Oct 07, 2010 at 20:04 UTC

    Hi,

    I'll make my questions a little bit clearer with some examples. The two main applications I have developed and experimented with were an n-gram counter and a high performance of the EM algorithm for automatically learning a translation probability table from bitexts.

    Let's consider counting n-gram occurrence in large document collections. If you are counting unigrams the vast majority of queries are collisions (the word has already been encountered and the count needs to be incremented). Performance problems are usually due to having to increase the number of buckets in the hash table due to an underestimation of how many buckets you would need. The longer the n-gram the less collisions, of course. The problem with collisions is that you have to wait for the result before you can make the decision to update the frequency.

    Now the EM implementation was built on HECToR, the largest super computer for academic purposes in Europe. HECToR has swap space disabled to keep applications memory bound. My implementation made careful memory usage counts to stop the hash table from growing too big. Again, due to the high cooccurrence of high frequency words in the bitexts in the training data the vast majority of queries were collisions. Again, the application was spending too much time waiting for the response to make the decision to update the most frequently occurring values in the table. Another interesting side effect of this was that any attempt to speed up my highly optimised version of this algorithm by using more processes in the shared memory space actually slowed it down. This was because serial optimisations based on cache usage go down the swanee when you start to use more cores. But this is by the by.

    In conclusion, I guess I'm more interested in seeing how different implementations compare in the case of a high collision rate. But, I'm also interested in general. I suppose a variety of benchmarks should be used.

    The only thing I'm not really interested in in is pageing issues. I have no interest in letting my hash tables overflow into swap space. In fact, exactly the opposite.