Getting around "/" as a word boundary

sherab has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I am running into a strange regex issue.... I have a document where I am doing a replace... as an example I want to replace "DEXX" with "DEXX/AREX" and then with the next substitution replace... "AREX" with "AREX/CUBE"

DEXX and AREX are stored in a hash like so.... "DEXX" => "AREX", "AREX" => "CUBE"

The regex I have is this.....

foreach (keys %hashstore){
    $doc=~s!\b($_)\b!$1/$hashstore{$_}!ig;
}
[download]

What's happening is that "DEXX" is being replaced with "DEXX/AREX" ok but when "DEXX/AREX" is encountered the regex is replacing "DEXX/AREX" with "DEXX/AREX/CUBE" when it should only be replacing "AREX" when it finds it as a standalone word not as part of another combination like "DEXX/AREX"

I thought that a negative lookback might work . i.e.....

foreach (keys %hashstore){
    $doc=~s#(?<!/)\b($_)\b#$1/$hashstore{$_}#ig;
}
[download]

No luck though

It seems to detect "/" as a word boundary. Has anyone encountered this or know of a fix around it? Many thanks!

Comment on Getting around "/" as a word boundary Select or Download Code

Replies are listed 'Best First'.
Re: Getting around "/" as a word boundary by ikegami (Patriarch) on Aug 12, 2010 at 04:22 UTC
The problem has nothing to do with word boundaries. The problem is that you're replacing your replacements. Fix: `my $pat = join '\|', keys(%hashstore); $doc =~ s!\b($pat)\b!$1/$hashstore{uc($1)}!ig;` [download] You're wrong about the negative lookbehind not working. my %hashstore = ( "DEXX" => "AREX", "AREX" => "CUBE" ); my $doc1 = "DEXX"; my $doc2 = "DEXX"; my $doc3 = "DEXX"; #foreach (keys %hashstore){ foreach ("DEXX", "AREX"){ # Make sure we get them in the worse order. $doc1=~s#\b($_)\b#$1/$hashstore{uc($_)}#ig; } #foreach (keys %hashstore){ foreach ("DEXX", "AREX"){ # Make sure we get them in the worse order. $doc2=~s#(?<!/)\b($_)\b#$1/$hashstore{uc($_)}#ig; } my $pat = join '\|', keys(%hashstore); $doc3 =~ s!\b($pat)\b!$1/$hashstore{uc($1)}!ig; print("$doc1\n"); # DEXX/AREX/CUBE print("$doc2\n"); # DEXX/AREX print("$doc3\n"); # DEXX/AREX [download] Note the addition of `uc()`. Without it, you'd match stuff you wouldn't find in the hash because of /i.	[reply] [d/l] [select]
Re^2: Getting around "/" as a word boundary by AnomalousMonk (Archbishop) on Aug 12, 2010 at 05:01 UTC
The use of `\|` (ordered alternation) in the regex introduces a subtlety: Perl's implementation of this regex operator finds the first possibile match in the alternation regardless of match length. Since the order of strings returned from keys is essentially random, this may not be what you want. The use of `\b` word boundaries and look-behind avoids the problem in the particular example given in Re: Getting around "/" as a word boundary, but this may not always be available. Usually, the longest match is needed. Sorting (in default order) and then reversing the order of sorted keys in the replacement hash produces the longest match: 'ABC', 'ABCD', 'ABCDE' (in any order) becomes 'ABCDE', 'ABCD', 'ABC'. E.g. (upper/lower case issues ignored): >perl -wMstrict -le "my %replace = ( DEXX => 'AREX', AREX => 'CUBE', ABC => 'VWX', ABCD => 'VWXY', ABCDE => 'VWXYZ', ); my $find = join '\|', map quotemeta, keys %replace; $find = qr{ $find }xms; print qq{find regex: $find}; my $s = 'DEXX AREX CUBE ABC ABCD ABCDE'; print qq{before: '$s'}; (my $t = $s) =~ s{ ($find) }{$1/$replace{$1}}xmsg; print qq{after: '$t'}; print ''; my $longest = join '\|', map quotemeta, reverse sort keys %replace; $longest = qr{ $longest }xms; print qq{find regex (longest match): $longest}; print qq{before: '$s'}; ($t = $s) =~ s{ ($longest) }{$1/$replace{$1}}xmsg; print qq{after: '$t'}; " find regex: (?msx-i: DEXX\|ABC\|ABCD\|ABCDE\|AREX ) before: 'DEXX AREX CUBE ABC ABCD ABCDE' after: 'DEXX/AREX AREX/CUBE CUBE ABC/VWX ABC/VWXD ABC/VWXDE' find regex (longest match): (?msx-i: DEXX\|AREX\|ABCDE\|ABCD\|ABC ) before: 'DEXX AREX CUBE ABC ABCD ABCDE' after: 'DEXX/AREX AREX/CUBE CUBE ABC/VWX ABCD/VWXY ABCDE/VWXYZ' [download] Updates: The example given above implies misleadingly that use of a properly ordered alternation alone is sufficient, that `\b` word boundaries are not needed in the case given in the OP. Not (necessarily) so: >perl -wMstrict -le "my %replace = ( ABC => 'XXX', ABCD => 'YYYY', ABCDE => 'ZZZZZ', ); my $find = join '\|', map quotemeta, reverse sort keys %replace; $find = qr{ $find }xms; print qq{find regex: $find}; my $s = 'ABC ABCD xxABCDxx ABCDE'; print qq{before: '$s'}; (my $t = $s) =~ s{ ($find) }{$replace{$1}}xmsg; print qq{sans \\b: '$t'}; print ''; print qq{before: '$s'}; ($t = $s) =~ s{ \b ($find) \b }{$replace{$1}}xmsg; print qq{with \\b: '$t'}; " find regex: (?msx-i: ABCDE\|ABCD\|ABC ) before: 'ABC ABCD xxABCDxx ABCDE' sans \b: 'XXX YYYY xxYYYYxx ZZZZZ' before: 'ABC ABCD xxABCDxx ABCDE' with \b: 'XXX YYYY xxABCDxx ZZZZZ' [download] See discussion of alternation in perlre and perlretut.	[reply] [d/l] [select]
Re^2: Getting around "/" as a word boundary by renshui (Novice) on Aug 12, 2010 at 08:34 UTC
Hi ikegami, I am a perl beginner, I don't get the following regex very much. Can you explain the (?<!/) part in detail? Thanks in advance. `$doc2=~s#(?<!/)\b($_)\b#$1/$hashstore{uc($_)}#ig;`	[reply] [d/l]
Re^3: Getting around "/" as a word boundary by AnomalousMonk (Archbishop) on Aug 12, 2010 at 08:48 UTC
This is a "zero-width, look-behind assertion". The assertion is true if `(?<!/)` does not immediately follow a '/' (forward slash) character at the point at which the assertion occurs in the regex. See `(?<!pattern)` in the Extended Patterns section (in the Look-Around Assertions subsection) of perlre. Update: Added "zero-width" to description.	[reply] [d/l] [select]
Re^3: Getting around "/" as a word boundary by ikegami (Patriarch) on Aug 12, 2010 at 16:14 UTC
"`(?<!/)`" means "not immediately preceded by '`/`'"	[reply] [d/l] [select]


Keep It Simple, Stupid
	PerlMonks