in reply to Re: Delimiters in Regexp::Common
in thread Delimiters in Regexp::Common

No bugs if you ask me. A couple of examples of the correct way to properly use single quotes are {-delim => '/\\'} or {-delim => '\/'}/ or {-delim => '\\/'}.

How the Regex module interprets this I don't know, probably a quotemeta or something like that.

The point is that in:

print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\\\/' }/ );

the quoted string becomes

-delim => '\/' ;

In:

print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\/' }/ );

The quoted string becomes:

-delim => '/' ;

This is because the \ is treated as an escape character beteween //.

It is always tedious in Perl how mistakes like this slip in. For example take a quick look at swl's example, I think that the '\' characters is specified twice in:

print $RE{delimited}{-delim => '[\\\/]'};

I even hope that I did not make any mistakes myself right now :P

Replies are listed 'Best First'.
Re^3: Delimiters in Regexp::Common (updated)
by AnomalousMonk (Archbishop) on May 07, 2018 at 14:46 UTC
    The point is that in:
    print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\\\/' }/ );
    the quoted string becomes
    -delim => '\/' ;
    In:
    print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\/' }/ );
    The quoted string becomes:
    -delim => '/' ;
    This is because the \ is treated as an escape character beteween //.

    I disagree. In the index expression of an array (positional or associative), the expression is evaluated in scalar context and not in the double-quotish context of a regex into which the array element may happen to be interpolated. So  '\\\/' and  '\/' are evaluated in single-quotish context and become the character sequences  \\/ and  \/ respectively. And because of the way backslashes are interpreted in single-quote context,  '\\\\/' and  '\\\/' are equivalent, and  '\\/' and  '\/' likewise. E.g.:

    c:\@Work\Perl\monks\Veltro>perl -wMstrict -MData::Dump -le "my %RE = ( '\\\\/' => 'BackBackFwd1', '\\\/' => 'BackBackFwd2', '\\/' => 'BackFwd1', '\/' => 'BackFwd2', '/' => 'Fwd', ); dd \%RE; ;; my $rx = qr{ $RE{'\\\\/'} $RE{'\\\/'} $RE{'\\/'} $RE{'\/'} $RE{'/'} } +; print $rx; " { "/" => "Fwd", "\\/" => "BackFwd2", "\\\\/" => "BackBackFwd2" } (?^: BackBackFwd2 BackBackFwd2 BackFwd2 BackFwd2 Fwd )

    There are a couple of Data::Dump::dd() and hash peculiarities:

    • Why is the key of the value  "BackBackFwd2" in the dd dump represented as  "\\\\/" when it's given as  '\\\/' in the hash definition? This is an artifact of the way dd represents strings only as double-quoted strings, so a single backslash can only be literally defined as the  "\\" escape sequence.
    • Why is there no  'BackBackFwd1' value in the hash? Because the  '\\\\/' and  '\\\/' string literals compile identical character sequences (update: i.e., identical keys), and the second key (with the value 'BackBackFwd2'.) supersedes the first. (And likewise with 'BackFwd1')


    Give a man a fish:  <%-{-{-{-<

      You may want to have another look at this because each of the next lines do not compile:

      print "P2 has path\n" if ($P2 =~ /$RE{delimited}{ -delim => '/' }/ ); print "P2 has path\n" if ($P2 =~ /$RE{delimited}{ -delim => '\\/' }/ ) +;
        ... the next lines do not compile: ...

        I've played around with this some more and I'm coming to the conclusion that this has little or nothing to do with Regexp::Common::delimited and more to do with the use of a regex delimiter character within the regex pattern. The following works as I expect with any of
            '\/'  '\\/'  '\\\/'  '\\\\/'  '\\\\\/'  '\\\\\\/'
        as the  -delim delimiter specification:

        c:\@Work\Perl\monks\Veltro>perl -wMstrict -le "use Regexp::Common qw(delimited); ;; for my $s (qw( a/b/c a\b\c /a/ \a\ a//b a\\\\b // \\\\ a/b a\b a/b\c a\b/c a/ /a a\ \a / \ )) { print qq{'$s' }, $s =~ m{$RE{delimited}{ -delim => '\/' }} ? '' : 'NO ', ' match'; } " 'a/b/c' match 'a\b\c' match '/a/' match '\a\' match 'a//b' match 'a\\b' match '//' match '\\' match 'a/b' NO match 'a\b' NO match 'a/b\c' NO match 'a\b/c' NO match 'a/' NO match '/a' NO match 'a\' NO match '\a' NO match '/' NO match '\' NO match
        Both  m: ... : and the balanced  m{ ... } (my personal preference per TheDamian's regex PBPs) yield the same results.

        For a  / ... / delimited match with the code above, the  -delim strings:

        • '\\\/'  '\\\\\/' work as expected;
        • '\\/'  '\\\\/'  '\\\\\\/' fail to compile (Can't find string terminator "'" ...); and
        • '\/' works partially as expected (go figure).
        Again, the lesson seems to be: be wary of the presence of a delimiter character within a regex pattern.

        IIRC from previous regex compilation discussions (and please don't ask me for a citation :), I think what's happening here is that the regex parser looks for the end of a regex using various heuristics as soon as it sees that a regex has opened, and in this case, it sees the forward-slash at the end of the first  '\\/' (or whatever) single-quoted string and sometimes mistakes it for the regex terminal delimiter. The Perl parser looks for single-quoted strings thereafter, and goes off the rails when it sees that a final single-quote is unmatched. Or something like that... Anyway, don't use  // regex delimiters here.

        Update: The "premature regex termination detection" theory is supported if the
            my $rx = qr{ $RE{'\\\\/'} $RE{'\\\/'} $RE{'\\/'} $RE{'\/'} $RE{'/'} };
        regex from Re^3: Delimiters in Regexp::Common (updated) is re-written with  qr/ ... / instead: the "Can't find string terminator "'" anywhere ..." compilation error results.


        Give a man a fish:  <%-{-{-{-<

        Regexp::Common returns regexp objects, so one can drop the outer // and it will compile.

        print "P2 has path\n" if ($P2 =~ $RE{delimited}{ -delim => '/' } ); print "P2 has path\n" if ($P2 =~ $RE{delimited}{ -delim => '\/' } );

        Then the escaping becomes a consideration.

        use 5.026; use Regexp::Common qw[ delimited ]; say '\/'; say $RE{delimited}{ -delim => '\/' }; say '\\/'; say $RE{delimited}{ -delim => '\\/' }; say '\\\/'; say $RE{delimited}{ -delim => '\\\/' }; say '\\\\/'; say $RE{delimited}{ -delim => '\\\\/' };

        produces

        \/ (?:(?|(?:\\)(?:[^\\]*(?:(?:\\\\)[^\\]*)*)(?:\\)|(?:\/)(?:[^\\\/]*(?:\\ +.[^\\\/]*)*)(?:\/))) \/ (?:(?|(?:\\)(?:[^\\]*(?:(?:\\\\)[^\\]*)*)(?:\\)|(?:\/)(?:[^\\\/]*(?:\\ +.[^\\\/]*)*)(?:\/))) \\/ (?:(?|(?:\\)(?:[^\\]*(?:(?:\\\\)[^\\]*)*)(?:\\)|(?:\\)(?:[^\\]*(?:(?:\ +\\\)[^\\]*)*)(?:\\)|(?:\/)(?:[^\\\/]*(?:\\.[^\\\/]*)*)(?:\/))) \\/ (?:(?|(?:\\)(?:[^\\]*(?:(?:\\\\)[^\\]*)*)(?:\\)|(?:\\)(?:[^\\]*(?:(?:\ +\\\)[^\\]*)*)(?:\\)|(?:\/)(?:[^\\\/]*(?:\\.[^\\\/]*)*)(?:\/)))

        It also appears that Regexp::Common does not de-duplicate the character sequence before it builds the regexp, as the regexps become more complicated as the sequences increase in length.