comment on

This approach has a subtle problem. The (?<!\w) and (?=\s|$) look-around assertions (see Look-Around Assertions) used as delimiters are ambiguous. One of the "words" (i.e., 'word word') in the dictionary contains a space, which matches both of the delimiters.

The Perl regex alternation is "ordered", i.e., the first match in the alternation is the overall match, not the longest match. The sorting used to generate the regex in the code above is ascending; shorter strings with a common initial group of characters will sort before longer strings. This produces a "shortest first" match in the alternation because the delimiters are ambiguous. If the word 'word' is introduced into the translation dictionary along with the ambiguous 'word word' "word", a mistranslation occurs.

This can be fixed by disambiguating the delimiters. Another fix is to build the alternation so that a "longest first" match is performed. In the example below, $rx_A is built from an ascending sort and produces a mistranslation. $rx_D is built from a descending sort that produces a longest-first match and translates properly. (If such a "longest" alternation is used, delimiters can often be dispensed with entirely. In general, I prefer to build "longest" alternations from lists into which ambiguous strings may creep.)

c:\@Work\Perl\monks>perl -wMstrict -le
"use Data::Dumper;
 ;;
 use constant SENTENCE => 'this word word (word) is a word';
 ;;
 my %dic = (
   'word word' => 'parola parola',
   'word'      => 'XXXX',
   '(word)'    => '(parola)',
   );
 print Dumper \%dic;
 ;;
 print '---------------';
 ;;
 my ($rx_A) =
   map  qr{ (?<!\w) (?: $_) (?=\s|$) }xms,
   join ' | ',
   map  quotemeta,
   sort { length $a <=> length $b }
   keys %dic
   ;
 print qq{rx_A: $rx_A};
 ;;
 my $s = SENTENCE;
 print qq{'$s'};
 $s =~ s/($rx_A)/$dic{$1}/g;
 print qq{'$s'};
 ;;
 print '---------------';
 ;;
 my ($rx_D) =
   map  qr{ (?<!\w) (?: $_) (?=\s|$) }xms,
   join ' | ',
   map  quotemeta,
   reverse sort { length $a <=> length $b }
   keys %dic
   ;
 print qq{rx_D: $rx_D};
 ;;
 $s = SENTENCE;
 print qq{'$s'};
 $s =~ s/($rx_D)/$dic{$1}/g;
 print qq{'$s'};
"
$VAR1 = {
          '(word)' => '(parola)',
          'word' => 'XXXX',
          'word word' => 'parola parola'
        };

---------------
rx_A: (?^msx: (?<!\w) (?: word | \(word\) | word\ word) (?=\s|$) )
'this word word (word) is a word'
'this XXXX XXXX (parola) is a XXXX'
---------------
rx_D: (?^msx: (?<!\w) (?: word\ word | \(word\) | word) (?=\s|$) )
'this word word (word) is a word'
'this parola parola (parola) is a XXXX'
[download]

See sort for other ways to produce ascending versus descending sorting.

Update: Here's an example where the (reversed) default lexical sort alone is sufficient to produce proper, longest-first translation entirely without delimiters:

c:\@Work\Perl\monks>perl -wMstrict -le
"my %dic = qw(Abc Zyx  Abcd Zyxw  Abcde Zyxwv);
 ;;
 use constant S => 'AbcAbcdAbcdeAbcdeAbcdAbc';
 ;;
 print '------------';
 my ($rx_A) =
   map  qr{ $_ }xms,
   join ' | ',
   sort
   map  quotemeta,
   keys %dic
   ;
 print qq{rx_A: $rx_A};
 ;;
 my $s = S;
 print qq{'$s'};
 $s =~ s{ ($rx_A) }{$dic{$1}}xmsg;
 print qq{'$s'};
 ;;
 print '------------';
 my ($rx_D) =
   map  qr{ $_ }xms,
   join ' | ',
   reverse sort
   map  quotemeta,
   keys %dic
   ;
 print qq{rx_D: $rx_D};
 ;;
 $s = S;
 print qq{'$s'};
 $s =~ s{ ($rx_D) }{$dic{$1}}xmsg;
 print qq{'$s'};
"
------------
rx_A: (?^msx: Abc | Abcd | Abcde )
'AbcAbcdAbcdeAbcdeAbcdAbc'
'ZyxZyxdZyxdeZyxdeZyxdZyx'
------------
rx_D: (?^msx: Abcde | Abcd | Abc )
'AbcAbcdAbcdeAbcdeAbcdAbc'
'ZyxZyxwZyxwvZyxwvZyxwZyx'
[download]

Give a man a fish: <%-{-{-{-<

In reply to Re^2: Symbols in regex by AnomalousMonk
in thread Symbols in regex by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.