Re^2: Perl's hash table implementation

Hi,

I'll make my questions a little bit clearer with some examples. The two main applications I have developed and experimented with were an n-gram counter and a high performance of the EM algorithm for automatically learning a translation probability table from bitexts.

Let's consider counting n-gram occurrence in large document collections. If you are counting unigrams the vast majority of queries are collisions (the word has already been encountered and the count needs to be incremented). Performance problems are usually due to having to increase the number of buckets in the hash table due to an underestimation of how many buckets you would need. The longer the n-gram the less collisions, of course. The problem with collisions is that you have to wait for the result before you can make the decision to update the frequency.

Now the EM implementation was built on HECToR, the largest super computer for academic purposes in Europe. HECToR has swap space disabled to keep applications memory bound. My implementation made careful memory usage counts to stop the hash table from growing too big. Again, due to the high cooccurrence of high frequency words in the bitexts in the training data the vast majority of queries were collisions. Again, the application was spending too much time waiting for the response to make the decision to update the most frequently occurring values in the table. Another interesting side effect of this was that any attempt to speed up my highly optimised version of this algorithm by using more processes in the shared memory space actually slowed it down. This was because serial optimisations based on cache usage go down the swanee when you start to use more cores. But this is by the by.

In conclusion, I guess I'm more interested in seeing how different implementations compare in the case of a high collision rate. But, I'm also interested in general. I suppose a variety of benchmarks should be used.

The only thing I'm not really interested in in is pageing issues. I have no interest in letting my hash tables overflow into swap space. In fact, exactly the opposite.

Comment on Re^2: Perl's hash table implementation