in reply to fill diacritic into text

Perhaps when reading the word frequency file, keep only those words that contain accented characters. That could save a bit of memory.

You then have to build a hash with the unaccented variants of those words as key and the original one as a value. Then read the second file, look up each word in it in the hash, and replace with the value if it exists. Take special care to preserving upper and lower case.

Replies are listed 'Best First'.
Re^2: fill diacritic into text
by jajaja (Initiate) on May 31, 2007 at 11:19 UTC
    yes you are right it would save memory but success rate would be lower because some words frequency is higher with unaccented letters and if id read only words that contain only accented characters i couldnt know which variant is usually more used.

      To tell the truth, I'm quite surprised that you have a word frequency file whose words don't fit in memory. But if this is really the case, you can do the following.

      First, transform the frequency file to another file by prefixing each line with the unaccented version of the word, but still keeping the accented version. You can do this easily without reading the whole file in memory. Then sort this file using the unaccented versions as a key. Then, read the sorted file. This time, you can do it in such a way that you only keep those lines in memory that are either accented, but do not have a larger frequency unaccented variant, because all the words for a given unaccented variant get together.