Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am writing a perl script that will replace in a text words or phrases included in a dictionary file.

For instance, the dictionary file looks like:

chinese francais restaurant restaurant chinese restaurant restaurant francais

and my text like:

This is a chinese restaurant

What I want is to replace the chinese restaurant with restaurant francais and not the chinese and restaurant separately.

My code looks like:

#read file in hash my %dictionary; while (<DIC>) { my ( $key, $tgt ) = split(/\t/, $_); push @{ $dictionary{$key} }, $tgt; } #apply dic to txt file while (<FILEIN>) { my $line = $_; foreach my $key (keys %dictionary) { #$line =~ s/%$src%/$dictionary{$key}/g; } print $line; }

I would appreciate any help

Replies are listed 'Best First'.
Re: replace text using hash
by kennethk (Abbot) on Sep 08, 2014 at 17:10 UTC
    In addition to what LanX says, a couple comments:
    • You'll probably want to escape metacharacters either using quotemeta or wrapping your terms with \Q and \E.

    • You may also wish to wrap your terms with \b to indicate word boundaries.

    • I note that you have a term that contains whitespace. Are you confident that all your whitespace will be single spaces, as opposed to tabs or newlines?

    • One feature of LanX's solution that was not mentioned was that because the substitution consumes the string, joining with the alternator prevents double substitution.

    So, my solution (untested) would likely look like:
    my $re = join '|', map "\Q$_\E", sort {length $b <=> length $a} keys % +dictionary; while (<FILEIN>) { s/($re)/$dictionary{$1}/g }
    This will not fix the whitespace issue. To do that, you'd probably need to slurp the file, modify the regex to treat whitespace agnostically, and then replicate the type of whitespace in your result. Doing this is probably a giant pain.

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

      > modify the regex to treat whitespace agnostically, and then replicate the type of whitespace in your result.

      OK I missed that part. But I think substituting each blank with a grouped (\s) should do.

      Like this the positions of each whitespace match is known and can be mapped back to blank before doing the hash look-up.

      Evidently using any whitespace different to blank must be forbidden in the keys.

      And at this point it gets obvious that looking at CPAN for a ready to use module should be a good idea! :)

      update

      In hindsight thats BS. This kind of translation with different length expressions can't reproduce good formatting.

      It's better to hold each paragraph in one line in a normalized form, where each non-blank whitespace is eliminated and to do the formatting again after substitution .

      Cheers Rolf

      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      > One feature of LanX's solution that was not mentioned

      true, I was too lazy to type more... ;-)

      > you'd probably need to slurp the file

      I think paragraphs should be enough. And either the /s or /m modifier should help with newline as whitespace. (still need a mnemonic but I think it was /m /s because it was counter-intuitive)

      update

      seems like \s already matches newline as whitespace. The /s modifier only effects the match-all-dot . to match newlines.

      So replacing all whitespaces in the keys with \s+ should be sufficient.

      Cheers Rolf

      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      Hi,

      Thank you for your replies. I have done it but the output is:

      This is a ARRAY(0xe5d6c0)

      Thanks

        See update of my first reply.

        The whole push is nonsense.

        Cheers Rolf

        (addicted to the Perl Programming Language and ☆☆☆☆ :)

Re: replace text using hash
by LanX (Saint) on Sep 08, 2014 at 16:45 UTC
    Just had a similar problem.

    First you need to sort the keys descending by length.

    Then you need to build an "or'ed" pattern of those keys.

    Like this the longest match has priority.

    Something like:

    $pattern = join "|", @sorted; $text =~ s/($pattern)/$hash{$1}/g;

    Untested typing into my mobile :)

    update

    Your code has some issues, first you push the tgt (?) into a HoA

    Then you are translating line by line, which won't catch expressions spanning line breaks.

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

Re: replace text using hash
by b4swine (Pilgrim) on Sep 09, 2014 at 01:28 UTC

    This is just the concept, not the code. It would make more sense for a large dictionary and a small paragraph. For example, if the dict hash has 100,000 entries, and the paragraph has 1000 words, you don't want to do 100,000 substitutions for the paragraph, or make a 100,000 word long regexp.

    You want to use a hash solution, (which eliminates the possibility of ordering), here is one way to do it. When you see a three word pattern for your dictionary, like "aa bb cc" -> "xyz", make three entries

    $dict{"aa bb cc"} = "xyz"; $cont{"aa bb"} = 1; $cont{"aa"} = 1;

    Now when it comes time to translate, read the file word for word. If the next words are "pp qq rr ss...", look up $cont{"pp"} if it is =1, then look up $cont{"pp qq"}, addding words until you have a phrase that is not in %cont. Now look for this phrase in %dict, and if not found drop the rightmost word and retry.