in reply to term frequency and mutual info

For each language file create a hash with the words as keys and a comma separated list of line numbers the data.

Since that hash will be quite large, use a database to store the hash. A very popular solution for a disk based hash is DBM::Deep, easy to use, fast, well tested.

If the hash fits into memory, you could accumulate the hash first in memory and then store it to disk. If not, initial creation of the hash will take somewhat longer, but not much thanks to disk caches. But it is a price you have to pay only once anyway

After that finding out the lines where 'un' occured is just a simple hash accesses and a split, practically instantuous

Replies are listed 'Best First'.
Re^2: term frequency and mutual info
by perl_lover_always (Acolyte) on Oct 22, 2010 at 08:26 UTC
    Thanks, I look into the database since I guess would be useful to create it once and use it anytime without cache waste.