Perhaps when reading the word frequency file, keep only those words that contain accented characters. That could save a bit of memory.
You then have to build a hash with the unaccented variants of those words as key and the original one as a value. Then read the second file, look up each word in it in the hash, and replace with the value if it exists. Take special care to preserving upper and lower case.
In reply to Re: fill diacritic into text
by ambrus
in thread fill diacritic into text
by jajaja
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |