Re^2: fill diacritic into text

yes you are right it would save memory but success rate would be lower because some words frequency is higher with unaccented letters and if id read only words that contain only accented characters i couldnt know which variant is usually more used.

Comment on Re^2: fill diacritic into text

Replies are listed 'Best First'.
Re^3: fill diacritic into text by ambrus (Abbot) on Jun 01, 2007 at 09:35 UTC
To tell the truth, I'm quite surprised that you have a word frequency file whose words don't fit in memory. But if this is really the case, you can do the following. First, transform the frequency file to another file by prefixing each line with the unaccented version of the word, but still keeping the accented version. You can do this easily without reading the whole file in memory. Then sort this file using the unaccented versions as a key. Then, read the sorted file. This time, you can do it in such a way that you only keep those lines in memory that are either accented, but do not have a larger frequency unaccented variant, because all the words for a given unaccented variant get together.	[reply]

Replies are listed 'Best First'.

Re^3: fill diacritic into text
by ambrus (Abbot) on Jun 01, 2007 at 09:35 UTC

To tell the truth, I'm quite surprised that you have a word frequency file whose words don't fit in memory. But if this is really the case, you can do the following.

First, transform the frequency file to another file by prefixing each line with the unaccented version of the word, but still keeping the accented version. You can do this easily without reading the whole file in memory. Then sort this file using the unaccented versions as a key. Then, read the sorted file. This time, you can do it in such a way that you only keep those lines in memory that are either accented, but do not have a larger frequency unaccented variant, because all the words for a given unaccented variant get together.

[reply]