http://qs1969.pair.com?node_id=517243


in reply to Creating Dictionaries

IMHO, the problem is not the input being sorted but all the entries being unique and causing the hash to grow too much and eating all the memory. On common text files, most words are repetitions of already found words and so, they don't make the hash grow.

There are several ways to solve that problem, for instance, you can try using an on disk tree with DB_File.

Another way is to flush all the words found to temporal files on disk everytime their number goes over some limit, and at the end, perform a merge sort and eliminate duplicates.