replace text using hash

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I am writing a perl script that will replace in a text words or phrases included in a dictionary file.

For instance, the dictionary file looks like:

chinese    francais
restaurant     restaurant
chinese restaurant    restaurant francais
[download]

and my text like:

This is a chinese restaurant

What I want is to replace the chinese restaurant with restaurant francais and not the chinese and restaurant separately.

My code looks like:

#read file in hash
my %dictionary;

while (<DIC>) {
    my ( $key, $tgt ) = split(/\t/, $_);
    push @{ $dictionary{$key} }, $tgt;

}

#apply dic to txt file
while (<FILEIN>) {

my $line = $_;


    foreach my $key (keys %dictionary) {

            #$line =~ s/%$src%/$dictionary{$key}/g;

    }


print $line;

}
[download]

I would appreciate any help

Comment on replace text using hash Select or Download Code

Replies are listed 'Best First'.
Re: replace text using hash by kennethk (Abbot) on Sep 08, 2014 at 17:10 UTC
In addition to what LanX says, a couple comments: You'll probably want to escape metacharacters either using quotemeta or wrapping your terms with \Q and \E. You may also wish to wrap your terms with `\b` to indicate word boundaries. I note that you have a term that contains whitespace. Are you confident that all your whitespace will be single spaces, as opposed to tabs or newlines? One feature of LanX's solution that was not mentioned was that because the substitution consumes the string, joining with the alternator prevents double substitution. So, my solution (untested) would likely look like: `my $re = join '\|', map "\Q$_\E", sort {length $b <=> length $a} keys % +dictionary; while (<FILEIN>) { s/($re)/$dictionary{$1}/g }` [download] This will not fix the whitespace issue. To do that, you'd probably need to slurp the file, modify the regex to treat whitespace agnostically, and then replicate the type of whitespace in your result. Doing this is probably a giant pain. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re^2: replace text using hash by LanX (Saint) on Sep 08, 2014 at 18:10 UTC
> modify the regex to treat whitespace agnostically, and then replicate the type of whitespace in your result. OK I missed that part. But I think substituting each blank with a grouped `(\s)` should do. Like this the positions of each whitespace match is known and can be mapped back to blank before doing the hash look-up. Evidently using any whitespace different to blank must be forbidden in the keys. And at this point it gets obvious that looking at CPAN for a ready to use module should be a good idea! :) update In hindsight thats BS. This kind of translation with different length expressions can't reproduce good formatting. It's better to hold each paragraph in one line in a normalized form, where each non-blank whitespace is eliminated and to do the formatting again after substitution . Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply] [d/l]
Re^2: replace text using hash by LanX (Saint) on Sep 08, 2014 at 17:29 UTC
> One feature of LanX's solution that was not mentioned true, I was too lazy to type more... ;-) > you'd probably need to slurp the file I think paragraphs should be enough. And either the `/s` or `/m` modifier should help with newline as whitespace. (still need a mnemonic but I think it was ~~`/m`~~ `/s` because it was counter-intuitive) update seems like `\s` already matches newline as whitespace. The `/s` modifier only effects the match-all-dot `.` to match newlines. So replacing all whitespaces in the keys with `\s+` should be sufficient. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply] [d/l] [select]
Re^2: replace text using hash by Anonymous Monk on Sep 09, 2014 at 13:33 UTC
Hi, Thank you for your replies. I have done it but the output is: `This is a ARRAY(0xe5d6c0)` Thanks	[reply] [d/l]
Re^3: replace text using hash by LanX (Saint) on Sep 09, 2014 at 13:57 UTC
See update of my first reply. The whole `push` is nonsense. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply] [d/l]
Re^4: replace text using hash by Anonymous Monk on Sep 09, 2014 at 14:12 UTC
Re^5: replace text using hash by kennethk (Abbot) on Sep 09, 2014 at 15:15 UTC
Re^5: replace text using hash by LanX (Saint) on Sep 09, 2014 at 15:34 UTC
Re: replace text using hash by LanX (Saint) on Sep 08, 2014 at 16:45 UTC
Just had a similar problem. First you need to sort the keys descending by length. Then you need to build an "or'ed" pattern of those keys. Like this the longest match has priority. Something like: `$pattern = join "\|", @sorted; $text =~ s/($pattern)/$hash{$1}/g;` [download] Untested typing into my mobile :) update Your code has some issues, first you push the tgt (?) into a HoA Then you are translating line by line, which won't catch expressions spanning line breaks. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :)}	[reply] [d/l]
Re: replace text using hash by b4swine (Pilgrim) on Sep 09, 2014 at 01:28 UTC
This is just the concept, not the code. It would make more sense for a large dictionary and a small paragraph. For example, if the dict hash has 100,000 entries, and the paragraph has 1000 words, you don't want to do 100,000 substitutions for the paragraph, or make a 100,000 word long regexp. You want to use a hash solution, (which eliminates the possibility of ordering), here is one way to do it. When you see a three word pattern for your dictionary, like "`aa bb cc`" -> "`xyz`", make three entries `$dict{"aa bb cc"} = "xyz"; $cont{"aa bb"} = 1; $cont{"aa"} = 1;` [download] Now when it comes time to translate, read the file word for word. If the next words are "`pp qq rr ss`...", look up `$cont{"pp"}` if it is `=1`, then look up `$cont{"pp qq"}`, addding words until you have a phrase that is not in `%cont`. Now look for this phrase in `%dict`, and if not found drop the rightmost word and retry.	[reply] [d/l] [select]

Back to Seekers of Perl Wisdom