jajaja has asked for the wisdom of the Perl Monks concerning the following question:

Hello I want to put diacritic into 30mb text file "input". I have a file "cetnosti" with most used words looking like this
a 14458708 se 10848091 v 10688846 na 8721120 je 5353514 že 4991304
in left column its a word and in right column number of occurencies in some different text of same language. i have file "input" looking like this
Je mi urcite cti, avsak predstavit pomerne strucne, a navic bez moznos praktickych ukazek, nas hlavni a nosny produkt, muze zpusobit I male komplikace. Proto prijmete prosim tento clanek jako snahu, poskytnout
and i need to replace the words in "input" with words with highest occrencies from "cetnosti" and write it into output. the problem is that file cetnosti is too big to read it all into memory so i read only beginning from it with most used words
use Tree::Trie; $trie = new Tree::Trie; $filei = "cetnosti"; $filer = "input"; $filew = "output"; open(INFO, $filei) || die "error: couldnt open file: $!"; $lineno=1; while ((defined ($line = <INFO>)) && ($lineno < 100000)) { $line =~ s/\t.*//g; $line =~ s/\n//g; $trie->add($line); $lineno++; } close(INFO);
now i was thinking how to replace words from input and write them into output
open(READ, $filer) || die "error: couldnt open file: $!"; open(WRITE, "> $filew") || die "error: couldnt open file: $!"; while (defined ($line = <READ>)) { @words = split(/ /, $line); #here id like to compare word from each line of "input" with "cetnosti +" and write it to "output" but i have no idea how to do it } close(READ); close(WRITE);
anybody can think about some good way? thank you for help

Replies are listed 'Best First'.
Re: fill diacritic into text
by Fletch (Bishop) on May 30, 2007 at 16:26 UTC
      This is also what I would suggest. You need some sort of Database conversation so that you do not have to load everything in memory. Then all you need to do is write logic around the "input" word in some sort of query.

      SELECT max(fl.frequency), fl.replace_word FROM freqency_list fl INNER JOIN synonyms syn ON syn.wordid = fl.worid AND syn.word = $myword

      This of course means you would have to build your database tables so that you could make use of this data. The query I just wrote is an example of what you COULD do if you had the data loaded into a database.

      One thing I do not understand about your exercise is how you are deciding which word is synonymatic with another word. How do you relate a word + frequency count to a word in your text?

      There has to be a relationship established between the words, otherwise you have a meaningless list...
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: fill diacritic into text
by graff (Chancellor) on May 30, 2007 at 18:23 UTC
    I'm a little confused about your task (maybe you are too?)...

    Your sample of "input" data contains no diactritics -- just plain (unaccented) ascii letters -- and your sample of "cetnosti" data has some words with diacritics (e.g. "že"). You seem to be saying: for some of the (unaccented) words that occur in "input", you want to replace them with (accented) words from "cetnosti". Is that right?

    I think the first thing you have to cover is how to relate accented letters to their unaccented (ascii) counterparts (e.g. where "cetnosti" has "že", "input" will have "ze"). Then you have to maintain a hash of words containing accented characters, keyed by their unaccented version -- that is:

    my %respell = ( 'ze' => "\x{017e}e", ... );
    (updated that to use correct quote marks)

    I don't think you need Tree::Trie for this. It looks like a great module, which applies a prefix-lookup scheme that I've used myself on occasion, so I'm glad to know there's a name and a module for that approach. I'm very grateful to you for bringing it to my attention, but I don't think you need it for this application.

    You also don't need to read (and hold in memory) all the contents of your "cetnosti" file. You just need to keep the words that contain accented letters, and store those in a hash keyed by the "unaccented" variant of the word. Something like this to load the hash ought to work (I'd read about Unicode::Normalize before, but now that I try it out, it's really cool):

    use Unicode::Normalize; my %respell; open( INFO, "<:utf8", "cetnosti" ) or die "cetnosti: $!"; while (<INFO>) { next unless ( /[^[:ascii:]]/ ); # skip words that are all-ascii my ( $word, $freq ) = split; my $ascii_word = NFD( $word ); # break accented letters into lett +er, diacritic $ascii_word =~ s/[^[:ascii:]]+//g; # delete diacritics $respell{$ascii_word} = $word; } close INFO;
    Assuming that works as intended, now you just need to go through your input file, tokenize it as needed, and check each token to see if it exists as a hash key in %respell. If so, replace the token with the value of that hash element:
    open( INPUT, "<:utf8", "input" ) or die "input: $!"; open( OUTPUT, ">:utf8", "respelled" ) or die "respelled: $!"; while (<INPUT>) { my $outstr = ''; for my $tkn ( split /(\s+)/ ) { if ( exists( $respell{$tkn} )) { $tkn = $respell{$tkn}; } $outstr .= $tkn; } print OUTPUT $outstr; }
    My code snippets have not done anything to handle upper vs. lower case in the input (or cetnosti), but you should be able to work that out; also, if the "input" file has punctuation (e.g. "word, word. word?" etc), you'll need to factor that into the split regex;  /([\s\p{P}]+)/ would probably work for that.

    (Notice that I'm putting parens in the split regex -- that captures whatever character sequence makes up a token boundary, so that the whole string can easily be put back together with all the original token boundaries intact.)

    UPDATE: WARNING: This sort of token replacement will do serious damage when the language in question has sets of words that are distinguished only by accent marks -- e.g. I would not use this approach for Spanish, because there are many pairs of common words like "que" and "qué", where the accent difference is significant; the code shown above would obliterate it.

      thank you a lot for this idea.. it didnt work properly with Normalize module so i tried it this way... sorry for taking out strict :)
      $times = time; $filei = "cetnosti"; $filer = "input"; $filew = "output"; $filec = "correct"; open( INFO, $filei ) or die "cetnosti: $!"; $lineno = 1; while ((defined ($_ = <INFO>)) && ($lineno < 500000)) { ( $word, $freq ) = split; $ascii_word = $word; $ascii_word =~ tr/&#318;&#353;&#269;&#357;&#382;ýáíéäú&#328;ô& +#283;&#345;&#341;&#314;&#367;ó&#271;&#317;&#352;&#268;&#356;&#381;ÝÁÍ +ÉÄÚ&#327;Ô&#282;&#344;&#340;&#313;&#366;Ó&#270;/lsctzyaieaunoerrluodL +SCTZYAIEAUNOERRLUOD/; $lineno++; if ( exists( $respell{$ascii_word} )) { next; } $respell{$ascii_word} = $word; } close INFO; open( INPUT, $filer ) or die "input: $!"; open( OUTPUT, "> $filew" ) or die "respelled: $!"; while (<INPUT>) { $outstr = ''; for $tkn ( split /([\s\p{P}]+)/ ) { if ( exists( $respell{$tkn} )) { $tkn = $respell{$tkn}; } $outstr .= $tkn; } print OUTPUT $outstr; } close INPUT; close OUTPUT; $timee = time; $timer = $timee - $times; print "execution time: $timer seconds\n";
      im sure there are many beginners mistakes but it works :) now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you
        sorry for taking out strict :)

        Maybe you don't know yet how sorry you might be later. ;)

        now i would need to compare "output" with "correct". "correct" is a file with diacritic and i need to know how many words were replaced good. is there some way to do this in perl? thank you

        Presumably, the "correct" file and your "test output" file should have the same number of lines and the same number of word tokens. (The unix "wc" command would be good for confirming that -- if you have ms-windows with cygwin installed, "wc" comes with that; for any given input file, it reports the number of lines, words and bytes.)

        And if you have "wc", then you also have the unix "diff" command. No perl scripting necessary for this task. But if you wanted to write a perl script for it anyway, just open both files for input, use a single loop that will read a line from each file, tokenize the two corresponding lines into two arrays, then use a nested loop to compare the tokens. Nothing complicated about that.

Re: fill diacritic into text
by BrowserUk (Patriarch) on May 30, 2007 at 18:16 UTC

    How many items/lines are there in your "cetnosti" file?

    What properties of Tree::Trie are you using or intending to use that are beneficial over a standard hash? I ask because in a couple of inconclusive quick tests, Tree::Trie seems to use ~twice as much memory as a standard hash to store the same information.

    It also looks to me as if you are not using Tree::Trie correctly anyway. You appear to be running the key and value elements of the lookup items together into a single value and then storing them. Which if I uunderstand your purpose correctly--which is not a given as your description is far from clear--, means that you will never find anything in your input file within the tree.

    All in all, I think you would be best to ask your question again. This time, post an few examples of the input data and the types of transformations that you are hoping to perform.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: fill diacritic into text
by ambrus (Abbot) on May 31, 2007 at 09:24 UTC

    Perhaps when reading the word frequency file, keep only those words that contain accented characters. That could save a bit of memory.

    You then have to build a hash with the unaccented variants of those words as key and the original one as a value. Then read the second file, look up each word in it in the hash, and replace with the value if it exists. Take special care to preserving upper and lower case.

      yes you are right it would save memory but success rate would be lower because some words frequency is higher with unaccented letters and if id read only words that contain only accented characters i couldnt know which variant is usually more used.

        To tell the truth, I'm quite surprised that you have a word frequency file whose words don't fit in memory. But if this is really the case, you can do the following.

        First, transform the frequency file to another file by prefixing each line with the unaccented version of the word, but still keeping the accented version. You can do this easily without reading the whole file in memory. Then sort this file using the unaccented versions as a key. Then, read the sorted file. This time, you can do it in such a way that you only keep those lines in memory that are either accented, but do not have a larger frequency unaccented variant, because all the words for a given unaccented variant get together.