rongrw has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I've recently started to use Regexp::Common to write more readable regular expressions. In particular, I'm using the "delimited" feature set to identify path names. However, it seems using backslash "\" as a delimiter won't work. The code example below illustrates the problem.

I'm not sure whether the problem is with the way I'm using Regexp::Common, or whether I've uncovered a bug with the module. Any suggestions would be most appreciated.

use Regexp::Common qw[ delimited ]; my $P1 = '../matrix-ops/matopmul.mk'; my $P2 = 'C:\matrix-ops\matopmul.mk'; print "P1 has path\n" if ($P1 =~ /$RE{delimited}{-delim => '\/'}/ ); print "P2 has path\n" if ($P2 =~ /$RE{delimited}{-delim => '\/'}/ );

Cheers, Ron.

Replies are listed 'Best First'.
Re: Delimiters in Regexp::Common
by swl (Prior) on May 06, 2018 at 10:43 UTC

    Use '\\' instead of '\/'

    print "P2 has path\n" if ($P2 =~ /$RE{delimited}{-delim => '\\'}/ );

      Thank you. Yes, that certainly works for $P2.
      What I really wanted though was a single specification for "-delim =>" that uses back-slash OR forward-slash as delimiters.

      However, your suggestion has sparked an idea. It turns out that {-delim => '\\\/'} does indeed work for both cases. That's great!

      Cheers, Ron.

        Good to hear it works. I would have used a bracketed character class like [\\\/] but it turns out that a sequence of delimiters is treated as individual characters so that's unnecessary. It's in the docs now that I read them, so the rest of the post is partly for reference by my future self.

        print $RE{delimited}{-delim => 'ab'};

        gives

        (?:(?|(?:a)(?:[^\\a]*(?:\\.[^\\a]*)*)(?:a)|(?:b)(?:[^\\b]*(?:\\.[^\\b] +*)*)(?:b)))

        If one uses a character class then the square brackets are treated as delimiters.

        print $RE{delimited}{-delim => '[\\\/]'};

        gives

        (?:(?|(?:\[)(?:[^\\\[]*(?:\\.[^\\\[]*)*)(?:\[)|(?:\\)(?:[^\\]*(?:(?:\\ +\\)[^\\]*)*)(?:\\)|(?:\\)(?:[^\\]*(?:(?:\\\\)[^\\]*)*)(?:\\)|(?:\/)(? +:[^\\\/]*(?:\\.[^\\\/]*)*)(?:\/)|(?:\])(?:[^\\\]]*(?:\\.[^\\\]]*)*)(? +:\])))

        ...and a RegExp object appears to be stringified and then treated as a sequence of characters.

        print qr'a'; print "\n\n"; print $RE{delimited}{-delim => qr'a'}; print "\n";

        gives

        (?^:a) (?:(?|(?:\()(?:[^\\\(]*(?:\\.[^\\\(]*)*)(?:\()|(?:\?)(?:[^\\\?]*(?:\\. +[^\\\?]*)*)(?:\?)|(?:\^)(?:[^\\\^]*(?:\\.[^\\\^]*)*)(?:\^)|(?:\:)(?:[ +^\\\:]*(?:\\.[^\\\:]*)*)(?:\:)|(?:a)(?:[^\\a]*(?:\\.[^\\a]*)*)(?:a)|( +?:\))(?:[^\\\)]*(?:\\.[^\\\)]*)*)(?:\))))
Re: Delimiters in Regexp::Common
by AnomalousMonk (Archbishop) on May 06, 2018 at 16:55 UTC
    ... a bug with the module.

    The behavior you're seeing is due to confusion between the  // used to delimit the regex and a  / used within the regex. IIRC, Regexp::Common mungs the regex operators | uses the tie-ed hash %RE to support things like  -delim and  -keep and so forth, and apparently the delimiter confusion is passed over silently as a result. "Properly" delimited regexes don't have this problem:

    c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw[ delimited ]; my $P1 = '../matrix-ops/matopmul.mk'; my $P2 = 'C:\matrix-ops\matopmul.mk'; print \"P1 has path\n\" if ($P1 =~ /$RE{delimited}{-delim => '\/'}/ ) +; print \"P2 has path\n\" if ($P2 =~ /$RE{delimited}{-delim => '\/'}/ ) +; " P1 has path c:\@Work\Perl\monks>perl -wMstrict -le "use Regexp::Common qw[ delimited ]; my $P1 = '../matrix-ops/matopmul.mk'; my $P2 = 'C:\matrix-ops\matopmul.mk'; print \"P1 has path\n\" if ($P1 =~ m{$RE{delimited}{-delim => '\/'}} +); print \"P2 has path\n\" if ($P2 =~ m{$RE{delimited}{-delim => '\/'}} +); " P1 has path P2 has path
    What I have for $Regexp::Common::VERSION and $Regexp::Common::delimited::VERSION are 2011121001 and 2010010201, respectively, so what I have installed is a bit old. Same results under both ActiveState 5.8.9 and Strawberry 5.14.4.1.

    I don't know if this behavior constitutes a bug in the module or not, but in regexes in general, if a character that's used to delimit the regex appears unescaped within the regex, that's a problem:

    c:\@Work\Perl\monks>perl -wMstrict -le "print 'match' if '/' =~ ///; " syntax error at -e line 1, near "/;" Execution of -e aborted due to compilation errors. c:\@Work\Perl\monks>perl -wMstrict -le "print 'match' if '/' =~ /\//; " match

    (And BTW: I'm not sure how the presence of a delimited sequence or subsequence signifies a "path"; there may be some semantic confusion here.)

    Update 1: If you're interested in more readable regexes, do yourself a huge favor and investigate the  /x regex modifier.

    Update 2: Fixed small, cosmetic-only formatting glitch in first two code examples.


    Give a man a fish:  <%-{-{-{-<

      No bugs if you ask me. A couple of examples of the correct way to properly use single quotes are {-delim => '/\\'} or {-delim => '\/'}/ or {-delim => '\\/'}.

      How the Regex module interprets this I don't know, probably a quotemeta or something like that.

      The point is that in:

      print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\\\/' }/ );

      the quoted string becomes

      -delim => '\/' ;

      In:

      print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\/' }/ );

      The quoted string becomes:

      -delim => '/' ;

      This is because the \ is treated as an escape character beteween //.

      It is always tedious in Perl how mistakes like this slip in. For example take a quick look at swl's example, I think that the '\' characters is specified twice in:

      print $RE{delimited}{-delim => '[\\\/]'};

      I even hope that I did not make any mistakes myself right now :P

        The point is that in:
        print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\\\/' }/ );
        the quoted string becomes
        -delim => '\/' ;
        In:
        print "P1 has path\n" if ($P1 =~ /$RE{delimited}{ -delim => '\/' }/ );
        The quoted string becomes:
        -delim => '/' ;
        This is because the \ is treated as an escape character beteween //.

        I disagree. In the index expression of an array (positional or associative), the expression is evaluated in scalar context and not in the double-quotish context of a regex into which the array element may happen to be interpolated. So  '\\\/' and  '\/' are evaluated in single-quotish context and become the character sequences  \\/ and  \/ respectively. And because of the way backslashes are interpreted in single-quote context,  '\\\\/' and  '\\\/' are equivalent, and  '\\/' and  '\/' likewise. E.g.:

        c:\@Work\Perl\monks\Veltro>perl -wMstrict -MData::Dump -le "my %RE = ( '\\\\/' => 'BackBackFwd1', '\\\/' => 'BackBackFwd2', '\\/' => 'BackFwd1', '\/' => 'BackFwd2', '/' => 'Fwd', ); dd \%RE; ;; my $rx = qr{ $RE{'\\\\/'} $RE{'\\\/'} $RE{'\\/'} $RE{'\/'} $RE{'/'} } +; print $rx; " { "/" => "Fwd", "\\/" => "BackFwd2", "\\\\/" => "BackBackFwd2" } (?^: BackBackFwd2 BackBackFwd2 BackFwd2 BackFwd2 Fwd )

        There are a couple of Data::Dump::dd() and hash peculiarities:

        • Why is the key of the value  "BackBackFwd2" in the dd dump represented as  "\\\\/" when it's given as  '\\\/' in the hash definition? This is an artifact of the way dd represents strings only as double-quoted strings, so a single backslash can only be literally defined as the  "\\" escape sequence.
        • Why is there no  'BackBackFwd1' value in the hash? Because the  '\\\\/' and  '\\\/' string literals compile identical character sequences (update: i.e., identical keys), and the second key (with the value 'BackBackFwd2'.) supersedes the first. (And likewise with 'BackFwd1')


        Give a man a fish:  <%-{-{-{-<

Re: Delimiters in Regexp::Common
by rongrw (Acolyte) on May 09, 2018 at 14:51 UTC

    I would like to sincerely thank everyone who added their comments on this topic. A lot of very in-depth detail about regexes was posted - thank you.

    Initially I thought the following Regexp::Common expression was a great solution,

    $P2 =~ /$RE{delimited}{-delim => '\\\/'}/
    

    But then AnomalousMonk posted this expression,

    $P2 =~ m{$RE{delimited}{-delim => '\/'}}
    
    Its brilliant, and definitely my preferred choice because it lists the actual characters that are used as delimiters. The lesson I've learnt here is that if the delimiter character(s) are the same as those enclosing the regex, then choose a different set of characters to enclose the regex.

    Once again, thank you for all the helpful posts!
    - Ron.

      The lesson ... if the delimiter character(s) are the same as those enclosing the regex ...

      I think that the lesson should be that if a regex pattern contains any character that is the same as the regex delimiters, care should be exercised. Usually, it's enough to escape such embedded characters in the normal way, but in this particular case, any escapology does not take place within the normal context of regex interpolation, and that leads to some unexpected quirky behavior.


      Give a man a fish:  <%-{-{-{-<