Re: fill diacritic into text

I'm a little confused about your task (maybe you are too?)...

Your sample of "input" data contains no diactritics -- just plain (unaccented) ascii letters -- and your sample of "cetnosti" data has some words with diacritics (e.g. "že"). You seem to be saying: for some of the (unaccented) words that occur in "input", you want to replace them with (accented) words from "cetnosti". Is that right?

I think the first thing you have to cover is how to relate accented letters to their unaccented (ascii) counterparts (e.g. where "cetnosti" has "že", "input" will have "ze"). Then you have to maintain a hash of words containing accented characters, keyed by their unaccented version -- that is:

  my %respell = ( 'ze' => "\x{017e}e", ... );
[download]

(updated that to use correct quote marks)

I don't think you need Tree::Trie for this. It looks like a great module, which applies a prefix-lookup scheme that I've used myself on occasion, so I'm glad to know there's a name and a module for that approach. I'm very grateful to you for bringing it to my attention, but I don't think you need it for this application.

You also don't need to read (and hold in memory) all the contents of your "cetnosti" file. You just need to keep the words that contain accented letters, and store those in a hash keyed by the "unaccented" variant of the word. Something like this to load the hash ought to work (I'd read about Unicode::Normalize before, but now that I try it out, it's really cool):

use Unicode::Normalize;

my %respell;
open( INFO, "<:utf8", "cetnosti" ) or die "cetnosti: $!";
while (<INFO>) {
    next unless ( /[^[:ascii:]]/ ); # skip words that are all-ascii

    my ( $word, $freq ) = split;

    my $ascii_word = NFD( $word );  # break accented letters into lett
+er, diacritic
    $ascii_word =~ s/[^[:ascii:]]+//g; # delete diacritics

    $respell{$ascii_word} = $word;
}
close INFO;
[download]

Assuming that works as intended, now you just need to go through your input file, tokenize it as needed, and check each token to see if it exists as a hash key in %respell. If so, replace the token with the value of that hash element:

open( INPUT, "<:utf8", "input" ) or die "input: $!";
open( OUTPUT, ">:utf8", "respelled" ) or die "respelled: $!";

while (<INPUT>) {
    my $outstr = '';
    for my $tkn ( split /(\s+)/ ) {
        if ( exists( $respell{$tkn} )) {
            $tkn = $respell{$tkn};
        }
        $outstr .= $tkn;
    }
    print OUTPUT $outstr;
}
[download]

My code snippets have not done anything to handle upper vs. lower case in the input (or cetnosti), but you should be able to work that out; also, if the "input" file has punctuation (e.g. "word, word. word?" etc), you'll need to factor that into the split regex; /([\s\p{P}]+)/ would probably work for that.

(Notice that I'm putting parens in the split regex -- that captures whatever character sequence makes up a token boundary, so that the whole string can easily be put back together with all the original token boundaries intact.)

UPDATE: WARNING: This sort of token replacement will do serious damage when the language in question has sets of words that are distinguished only by accent marks -- e.g. I would not use this approach for Spanish, because there are many pairs of common words like "que" and "qué", where the accent difference is significant; the code shown above would obliterate it.

Comment on Re: fill diacritic into text Select or Download Code

Replies are listed 'Best First'.
Re^2: fill diacritic into text by jajaja (Initiate) on May 31, 2007 at 06:27 UTC
thank you a lot for this idea.. it didnt work properly with Normalize module so i tried it this way... sorry for taking out strict :) $times = time; $filei = "cetnosti"; $filer = "input"; $filew = "output"; $filec = "correct"; open( INFO, $filei ) or die "cetnosti: $!"; $lineno = 1; while ((defined ($_ = <INFO>)) && ($lineno < 500000)) { ( $word, $freq ) = split; $ascii_word = $word; $ascii_word =~ tr/ľščťžЩАМИДЗňТ& +#283;řŕĺůСďĽŠČŤŽщам +идзŇтĚŘŔĹŮсĎ/lsctzyaieaunoerrluodL +SCTZYAIEAUNOERRLUOD/; $lineno++; if ( exists( $respell{$ascii_word} )) { next; } $respell{$ascii_word} = $word; } close INFO; open( INPUT, $filer ) or die "input: $!"; open( OUTPUT, "> $filew" ) or die "respelled: $!"; while (<INPUT>) { $outstr = ''; for $tkn ( split /([\s\p{P}]+)/ ) { if ( exists( $respell{$tkn} )) { $tkn = $respell{$tkn}; } $outstr .= $tkn; } print OUTPUT $outstr; } close INPUT; close OUTPUT; $timee = time; $timer = $timee - $times; print "execution time: $timer seconds\n"; [download] im sure there are many beginners mistakes but it works :) now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you	[reply] [d/l]
Re^3: fill diacritic into text by graff (Chancellor) on May 31, 2007 at 12:56 UTC
sorry for taking out strict :) Maybe you don't know yet how sorry you might be later. ;) now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you Presumably, the "correct" file and your "test output" file should have the same number of lines and the same number of word tokens. (The unix "wc" command would be good for confirming that -- if you have ms-windows with cygwin installed, "wc" comes with that; for any given input file, it reports the number of lines, words and bytes.) And if you have "wc", then you also have the unix "diff" command. No perl scripting necessary for this task. But if you wanted to write a perl script for it anyway, just open both files for input, use a single loop that will read a line from each file, tokenize the two corresponding lines into two arrays, then use a nested loop to compare the tokens. Nothing complicated about that.	[reply]