comment on

Hi,

I'll make my questions a little bit clearer with some examples. The two main applications I have developed and experimented with were an n-gram counter and a high performance of the EM algorithm for automatically learning a translation probability table from bitexts.

Let's consider counting n-gram occurrence in large document collections. If you are counting unigrams the vast majority of queries are collisions (the word has already been encountered and the count needs to be incremented). Performance problems are usually due to having to increase the number of buckets in the hash table due to an underestimation of how many buckets you would need. The longer the n-gram the less collisions, of course. The problem with collisions is that you have to wait for the result before you can make the decision to update the frequency.

Now the EM implementation was built on HECToR, the largest super computer for academic purposes in Europe. HECToR has swap space disabled to keep applications memory bound. My implementation made careful memory usage counts to stop the hash table from growing too big. Again, due to the high cooccurrence of high frequency words in the bitexts in the training data the vast majority of queries were collisions. Again, the application was spending too much time waiting for the response to make the decision to update the most frequently occurring values in the table. Another interesting side effect of this was that any attempt to speed up my highly optimised version of this algorithm by using more processes in the shared memory space actually slowed it down. This was because serial optimisations based on cache usage go down the swanee when you start to use more cores. But this is by the by.

In conclusion, I guess I'm more interested in seeing how different implementations compare in the case of a high collision rate. But, I'm also interested in general. I suppose a variety of benchmarks should be used.

The only thing I'm not really interested in in is pageing issues. I have no interest in letting my hash tables overflow into swap space. In fact, exactly the opposite.

In reply to Re^2: Perl's hash table implementation by jc
in thread Perl's hash table implementation by jc

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.