Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hi,

I have a perl script to replace words or small phrases in a text file with a list that is saved to hash. I realized that when a phrase contains some symbols (i.e. #,.,(,), etc) the replacement is not performed.

Can you please have a look and let me know what's going wrong?

Here are the commands I use:

my $regex = join '|', map "\Q$_\E", sort {length $columnA <=> length $ +columnB} keys %dic; s/\b($regex)\b/$dic{$1}/g;

Thank you in advance for your time.

Replies are listed 'Best First'.
Re: Symbols in regex
by Preceptor (Deacon) on Nov 17, 2015 at 11:42 UTC

    Your problem is this:

    sort {length $columnA <=> length $columnB}

    That's not how sort works - it uses $a and $b to compare. Try instead:

    sort {length $a <=> length $b}

    Also - turn on use strict; use warnings; because this would have told you that:

    Global symbol "$columnA" requires explicit package name (did you forge +t to declare "my $columnA"?)

    The other gotcha is your use of \b - this is a word boundary, but it _doesn't_ match when you've got a symbol:

    print "matches first" if "a (word) in context " =~ m/\b\(word\)\b/; print "matches second" if "a (word) in context " =~ m/\(word\)/;

    First doesn't match, because space-to-bracket isn't a word boundary: A word boundary (\b ) is a spot between two characters that has a \w on one side of it and a \W on the other side of it (in either order), - both space and bracket are \W.

    So you can't do it like that - either you've got to skip the word boundary element, decide how much you like whitespace, or use a lookaround. Or just don't worry about it and allow substring matching. Depends a bit on what data you've got to worry about something like:

    print "matches third" if "a (word) in context " =~ m/(?<=\s)\(word\)(? +=\s)/;

    Uses lookbehind/lookahead to detect spaces. (You might need to handle start/end of line too) - so a 'not word' negative assertion might be required instead:

    print "matches fourth" if "a a in context (word) " =~ m/(?<!\w)\(word\ +)(?=\s|$)/;

    This should match (word) as long as it isn't immediately preceeded by a alpha-numeric (so is ok with start of line, where (?<=\W) wouldn't be) and a trailing space or end-of-line. If your case is more complicated that that (e.g you need to _not_ match ((word) here) then this should at least give you a start point.

    #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; my %dic; while (<DATA>) { chomp; my($columnA, $columnB) = split("\t", $_); $dic{$columnA} = $columnB; } print Dumper \%dic; my $regex = join ( "|", map {quotemeta} sort { length $a <=> length $b + } keys %dic ); $regex = qr/(?<!\w)($regex)(?=\s|$)/; print $regex; my $sentence = "this word word (word) is a word\n"; $sentence =~ s/$regex/$dic{$1}/g; print $sentence; __DATA__ word word parola parola (word) (parola)

      This approach has a subtle problem. The  (?<!\w) and  (?=\s|$) look-around assertions (see Look-Around Assertions) used as delimiters are ambiguous. One of the "words" (i.e., 'word word') in the dictionary contains a space, which matches both of the delimiters.

      The Perl regex alternation is "ordered", i.e., the first match in the alternation is the overall match, not the longest match. The sorting used to generate the regex in the code above is ascending; shorter strings with a common initial group of characters will sort before longer strings. This produces a "shortest first" match in the alternation because the delimiters are ambiguous. If the word  'word' is introduced into the translation dictionary along with the ambiguous  'word word' "word", a mistranslation occurs.

      This can be fixed by disambiguating the delimiters. Another fix is to build the alternation so that a "longest first" match is performed. In the example below,  $rx_A is built from an ascending sort and produces a mistranslation.  $rx_D is built from a descending sort that produces a longest-first match and translates properly. (If such a "longest" alternation is used, delimiters can often be dispensed with entirely. In general, I prefer to build "longest" alternations from lists into which ambiguous strings may creep.)

      c:\@Work\Perl\monks>perl -wMstrict -le "use Data::Dumper; ;; use constant SENTENCE => 'this word word (word) is a word'; ;; my %dic = ( 'word word' => 'parola parola', 'word' => 'XXXX', '(word)' => '(parola)', ); print Dumper \%dic; ;; print '---------------'; ;; my ($rx_A) = map qr{ (?<!\w) (?: $_) (?=\s|$) }xms, join ' | ', map quotemeta, sort { length $a <=> length $b } keys %dic ; print qq{rx_A: $rx_A}; ;; my $s = SENTENCE; print qq{'$s'}; $s =~ s/($rx_A)/$dic{$1}/g; print qq{'$s'}; ;; print '---------------'; ;; my ($rx_D) = map qr{ (?<!\w) (?: $_) (?=\s|$) }xms, join ' | ', map quotemeta, reverse sort { length $a <=> length $b } keys %dic ; print qq{rx_D: $rx_D}; ;; $s = SENTENCE; print qq{'$s'}; $s =~ s/($rx_D)/$dic{$1}/g; print qq{'$s'}; " $VAR1 = { '(word)' => '(parola)', 'word' => 'XXXX', 'word word' => 'parola parola' }; --------------- rx_A: (?^msx: (?<!\w) (?: word | \(word\) | word\ word) (?=\s|$) ) 'this word word (word) is a word' 'this XXXX XXXX (parola) is a XXXX' --------------- rx_D: (?^msx: (?<!\w) (?: word\ word | \(word\) | word) (?=\s|$) ) 'this word word (word) is a word' 'this parola parola (parola) is a XXXX'

      See sort for other ways to produce ascending versus descending sorting.

      Update: Here's an example where the (reversed) default lexical sort alone is sufficient to produce proper, longest-first translation entirely without delimiters:

      c:\@Work\Perl\monks>perl -wMstrict -le "my %dic = qw(Abc Zyx Abcd Zyxw Abcde Zyxwv); ;; use constant S => 'AbcAbcdAbcdeAbcdeAbcdAbc'; ;; print '------------'; my ($rx_A) = map qr{ $_ }xms, join ' | ', sort map quotemeta, keys %dic ; print qq{rx_A: $rx_A}; ;; my $s = S; print qq{'$s'}; $s =~ s{ ($rx_A) }{$dic{$1}}xmsg; print qq{'$s'}; ;; print '------------'; my ($rx_D) = map qr{ $_ }xms, join ' | ', reverse sort map quotemeta, keys %dic ; print qq{rx_D: $rx_D}; ;; $s = S; print qq{'$s'}; $s =~ s{ ($rx_D) }{$dic{$1}}xmsg; print qq{'$s'}; " ------------ rx_A: (?^msx: Abc | Abcd | Abcde ) 'AbcAbcdAbcdeAbcdeAbcdAbc' 'ZyxZyxdZyxdeZyxdeZyxdZyx' ------------ rx_D: (?^msx: Abcde | Abcd | Abc ) 'AbcAbcdAbcdeAbcdeAbcdAbc' 'ZyxZyxwZyxwvZyxwvZyxwZyx'


      Give a man a fish:  <%-{-{-{-<

Re: Symbols in regex
by Eily (Monsignor) on Nov 17, 2015 at 10:57 UTC

    sort {length $columnA <=> length $columnB} keys %dic; where do $columnA and $columnB come from? If you print $regex; do you get the expected result (an actually sorted list of alternatives?).

    If you don't give us a fully working exemple we can only guess blindly. See How do I post a question effectively?

    And to give us your input data, you can do something like:

    use Data::Dumper; # Code to fill %dic print Dumper \%dic;

Re: Symbols in regex
by AnomalousMonk (Archbishop) on Nov 17, 2015 at 13:12 UTC

    Because you posted anonymously, you can't change a post; someone has to follow along behind you and clean up the mess. Please register as a user on PerlMonks. Posting as a registered, logged-in user allows you to edit a post. Please see How do I change/delete my post? for info on this, including proper etiquet for changing posts.


    Give a man a fish:  <%-{-{-{-<

Re: Symbols in regex
by Anonymous Monk on Nov 17, 2015 at 10:50 UTC
    well, %dic is empty :/ ... why?
Re: Symbols in regex
by Anonymous Monk on Nov 17, 2015 at 11:35 UTC

    Hi,

    Thanks for your reply!

    I am sorry that I didn't tell you that.

    my %dic; while (<DIC>) { chomp; my($columnA, $columnB) = split("\t", $_); $dic{$columnA} = $columnB; }

    The DIC file is a tab-delimeted file. For instance,

    word word parola parola (word) (parola)

    The input text looks like:

    This is a (word)

    and should change to:

    This is a (parola)
Re: Symbols in regex
by Anonymous Monk on Nov 17, 2015 at 11:20 UTC

    Hi,

    Thanks for your reply!

    I am sorry that I didn't tell you that.

    <my %dic; while (<DIC>) { chomp; my($columnA, $columnB) = split("\t", $_); $glossary{$columnA} = $columnB; }

    The DIC file is a tab-delimeted file. For instance,

    word word parola parola word parola