Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Getting around "/" as a word boundary

by sherab (Scribe)
on Aug 12, 2010 at 03:42 UTC ( [id://854567]=perlquestion: print w/replies, xml ) Need Help??

sherab has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I am running into a strange regex issue.... I have a document where I am doing a replace... as an example I want to replace "DEXX" with "DEXX/AREX" and then with the next substitution replace... "AREX" with "AREX/CUBE"

DEXX and AREX are stored in a hash like so.... "DEXX" => "AREX", "AREX" => "CUBE"

The regex I have is this.....

foreach (keys %hashstore){ $doc=~s!\b($_)\b!$1/$hashstore{$_}!ig; }

What's happening is that "DEXX" is being replaced with "DEXX/AREX" ok but when "DEXX/AREX" is encountered the regex is replacing "DEXX/AREX" with "DEXX/AREX/CUBE" when it should only be replacing "AREX" when it finds it as a standalone word not as part of another combination like "DEXX/AREX"

I thought that a negative lookback might work . i.e.....

foreach (keys %hashstore){ $doc=~s#(?<!/)\b($_)\b#$1/$hashstore{$_}#ig; }

No luck though

It seems to detect "/" as a word boundary. Has anyone encountered this or know of a fix around it? Many thanks!

Replies are listed 'Best First'.
Re: Getting around "/" as a word boundary
by ikegami (Patriarch) on Aug 12, 2010 at 04:22 UTC
    The problem has nothing to do with word boundaries. The problem is that you're replacing your replacements. Fix:
    my $pat = join '|', keys(%hashstore); $doc =~ s!\b($pat)\b!$1/$hashstore{uc($1)}!ig;

    You're wrong about the negative lookbehind not working.

    my %hashstore = ( "DEXX" => "AREX", "AREX" => "CUBE" ); my $doc1 = "DEXX"; my $doc2 = "DEXX"; my $doc3 = "DEXX"; #foreach (keys %hashstore){ foreach ("DEXX", "AREX"){ # Make sure we get them in the worse order. $doc1=~s#\b($_)\b#$1/$hashstore{uc($_)}#ig; } #foreach (keys %hashstore){ foreach ("DEXX", "AREX"){ # Make sure we get them in the worse order. $doc2=~s#(?<!/)\b($_)\b#$1/$hashstore{uc($_)}#ig; } my $pat = join '|', keys(%hashstore); $doc3 =~ s!\b($pat)\b!$1/$hashstore{uc($1)}!ig; print("$doc1\n"); # DEXX/AREX/CUBE print("$doc2\n"); # DEXX/AREX print("$doc3\n"); # DEXX/AREX

    Note the addition of uc(). Without it, you'd match stuff you wouldn't find in the hash because of /i.

      The use of  | (ordered alternation) in the regex introduces a subtlety: Perl's implementation of this regex operator finds the first possibile match in the alternation regardless of match length. Since the order of strings returned from keys is essentially random, this may not be what you want. The use of  \b word boundaries and look-behind avoids the problem in the particular example given in Re: Getting around "/" as a word boundary, but this may not always be available.

      Usually, the longest match is needed. Sorting (in default order) and then reversing the order of sorted keys in the replacement hash produces the longest match: 'ABC', 'ABCD', 'ABCDE' (in any order) becomes 'ABCDE', 'ABCD', 'ABC'. E.g. (upper/lower case issues ignored):

      >perl -wMstrict -le "my %replace = ( DEXX => 'AREX', AREX => 'CUBE', ABC => 'VWX', ABCD => 'VWXY', ABCDE => 'VWXYZ', ); my $find = join '|', map quotemeta, keys %replace; $find = qr{ $find }xms; print qq{find regex: $find}; my $s = 'DEXX AREX CUBE ABC ABCD ABCDE'; print qq{before: '$s'}; (my $t = $s) =~ s{ ($find) }{$1/$replace{$1}}xmsg; print qq{after: '$t'}; print ''; my $longest = join '|', map quotemeta, reverse sort keys %replace; $longest = qr{ $longest }xms; print qq{find regex (longest match): $longest}; print qq{before: '$s'}; ($t = $s) =~ s{ ($longest) }{$1/$replace{$1}}xmsg; print qq{after: '$t'}; " find regex: (?msx-i: DEXX|ABC|ABCD|ABCDE|AREX ) before: 'DEXX AREX CUBE ABC ABCD ABCDE' after: 'DEXX/AREX AREX/CUBE CUBE ABC/VWX ABC/VWXD ABC/VWXDE' find regex (longest match): (?msx-i: DEXX|AREX|ABCDE|ABCD|ABC ) before: 'DEXX AREX CUBE ABC ABCD ABCDE' after: 'DEXX/AREX AREX/CUBE CUBE ABC/VWX ABCD/VWXY ABCDE/VWXYZ'

      Updates:

      1. The example given above implies misleadingly that use of a properly ordered alternation alone is sufficient, that  \b word boundaries are not needed in the case given in the OP. Not (necessarily) so:
        >perl -wMstrict -le "my %replace = ( ABC => 'XXX', ABCD => 'YYYY', ABCDE => 'ZZZZZ', ); my $find = join '|', map quotemeta, reverse sort keys %replace; $find = qr{ $find }xms; print qq{find regex: $find}; my $s = 'ABC ABCD xxABCDxx ABCDE'; print qq{before: '$s'}; (my $t = $s) =~ s{ ($find) }{$replace{$1}}xmsg; print qq{sans \\b: '$t'}; print ''; print qq{before: '$s'}; ($t = $s) =~ s{ \b ($find) \b }{$replace{$1}}xmsg; print qq{with \\b: '$t'}; " find regex: (?msx-i: ABCDE|ABCD|ABC ) before: 'ABC ABCD xxABCDxx ABCDE' sans \b: 'XXX YYYY xxYYYYxx ZZZZZ' before: 'ABC ABCD xxABCDxx ABCDE' with \b: 'XXX YYYY xxABCDxx ZZZZZ'
      2. See discussion of alternation in perlre and perlretut.

      Hi ikegami, I am a perl beginner, I don't get the following regex very much. Can you explain the (?<!/) part in detail? Thanks in advance.

      $doc2=~s#(?<!/)\b($_)\b#$1/$hashstore{uc($_)}#ig;

        This is a "zero-width, look-behind assertion". The assertion is true if  (?<!/) does not immediately follow a '/' (forward slash) character at the point at which the assertion occurs in the regex.

        See  (?<!pattern) in the Extended Patterns section (in the Look-Around Assertions subsection) of perlre.

        Update: Added "zero-width" to description.

        "(?<!/)" means "not immediately preceded by '/'"

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://854567]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-04-23 21:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found