in reply to memory consumption skyhigh...
Your data set could include up to 21,952,000,000,000 points if fully populated. Even though you are working with a sparsely populated version of that possible set, the number is still going to be huge. If we assume that each word will only form digraphs with 1/5 of available words, that leaves you with 175,616,000,000 data points to count. You are suffering from a combinatorical explosion.
Zen recommended using a database. I think that is good advice. MLDBM looks like a nice fit.
Update: Calculations based on 28000 words, not 27000. Credited Zen by name for his advice.
TGI says moo
|
|---|