Re^4: memory-efficient hash kind for incremental sort

To accumulate 300 million unique citations from 30,000 documents, you'd need to extract (on average) 10,000 from each document.

Now I've read a few papers where the citation list was longer than the document, but even if every citation had a title and 9 authors, that would require every one of those 30,000 documents to reference 1,000 other (unique) documents. (In addition to any that it had in common with one or more other papers.)

I want to allocate 2GB of RAM to the task.

If every one of those 300 million terms was referenced by just one of the 30,000 papers, you'd need 2 bytes (a 16 bit integer) minimum for each value in your hash(like) data structure. That 600 MB right there without considering the keys or data-structure overhead.

And, in Perl, that integer is going to mean at least an SvIV which is 16 bytes. Which makes your values require at least 4.4GB. For a normal perl hash, you'd need another 40GB for the keys!

I think that you are way over estimating the number of keys, but even if you reduce that estimation by an order of magnitude, there is no way you will contain the information in a native hash in 2GB. (Nor any data structure built on top of them.)

It seems pretty much clear that you will need to use some kind of permanent storage based DB.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^4: memory-efficient hash kind for incremental sort