Re^2: fill diacritic into text

thank you a lot for this idea.. it didnt work properly with Normalize module so i tried it this way... sorry for taking out strict :)

$times = time;
$filei = "cetnosti";
$filer = "input";
$filew = "output";
$filec = "correct";

open( INFO, $filei ) or die "cetnosti: $!";
$lineno = 1;
while ((defined ($_ = <INFO>)) && ($lineno < 500000)) {
        ( $word, $freq ) = split;
        $ascii_word = $word;
        $ascii_word =~ tr/&#318;&#353;&#269;&#357;&#382;ЩАМИДЗ&#328;Т&
+#283;&#345;&#341;&#314;&#367;С&#271;&#317;&#352;&#268;&#356;&#381;щам
+идз&#327;т&#282;&#344;&#340;&#313;&#366;с&#270;/lsctzyaieaunoerrluodL
+SCTZYAIEAUNOERRLUOD/;
        $lineno++;
        if ( exists( $respell{$ascii_word} )) {
            next;
            }
        $respell{$ascii_word} = $word;
        }
close INFO;

open( INPUT, $filer ) or die "input: $!";
open( OUTPUT, "> $filew" ) or die "respelled: $!";

while (<INPUT>) {
        $outstr = '';

        for $tkn ( split /([\s\p{P}]+)/ ) {

                if ( exists( $respell{$tkn} )) {
                $tkn = $respell{$tkn};

                }
        $outstr .= $tkn;
        }
        print OUTPUT $outstr;
        }
close INPUT;
close OUTPUT;
$timee = time;
$timer = $timee - $times;
print "execution time: $timer seconds\n";
[download]

im sure there are many beginners mistakes but it works :) now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you

Comment on Re^2: fill diacritic into text Download Code

Replies are listed 'Best First'.
Re^3: fill diacritic into text by graff (Chancellor) on May 31, 2007 at 12:56 UTC
sorry for taking out strict :) Maybe you don't know yet how sorry you might be later. ;) now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you Presumably, the "correct" file and your "test output" file should have the same number of lines and the same number of word tokens. (The unix "wc" command would be good for confirming that -- if you have ms-windows with cygwin installed, "wc" comes with that; for any given input file, it reports the number of lines, words and bytes.) And if you have "wc", then you also have the unix "diff" command. No perl scripting necessary for this task. But if you wanted to write a perl script for it anyway, just open both files for input, use a single loop that will read a line from each file, tokenize the two corresponding lines into two arrays, then use a nested loop to compare the tokens. Nothing complicated about that.	[reply]

Replies are listed 'Best First'.

Re^3: fill diacritic into text
by graff (Chancellor) on May 31, 2007 at 12:56 UTC

sorry for taking out strict :)

Maybe you don't know yet how sorry you might be later. ;)

now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you

Presumably, the "correct" file and your "test output" file should have the same number of lines and the same number of word tokens. (The unix "wc" command would be good for confirming that -- if you have ms-windows with cygwin installed, "wc" comes with that; for any given input file, it reports the number of lines, words and bytes.)

And if you have "wc", then you also have the unix "diff" command. No perl scripting necessary for this task. But if you wanted to write a perl script for it anyway, just open both files for input, use a single loop that will read a line from each file, tokenize the two corresponding lines into two arrays, then use a nested loop to compare the tokens. Nothing complicated about that.

[reply]