http://qs1969.pair.com?node_id=1181544

skkeni04 has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I have a doubt which might seem basic but here it is:

my $a = "This is Perl"; $a =~/^(.+)(e|r)(.*)$/;

I get $1= This is Pe; $2= r; $3= l;

According to me, it should be "$1= This is Perl". $2 and $3 should be null. So my doubt is how are the capture brackets evaluated?

Replies are listed 'Best First'.
Re: Basic Regular expression
by choroba (Cardinal) on Feb 09, 2017 at 16:30 UTC
    $2 can't be null, because it must be either e or r to make the match successful.

    Compare with

    $a =~ /^(.+)([er]?)(.*)$/;

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thanks, Got it.
Re: Basic Regular expression
by Marshall (Canon) on Feb 09, 2017 at 16:39 UTC
    from http://perldoc.perl.org/perlre.html:
    By default, a quantified subpattern is "greedy", that is, it will match as many times as possible (given a particular starting location) while still allowing the rest of the pattern to match. If you want it to match the minimum number of times possible, follow the quantifier with a "?" . Note that the meanings don't change, just the "greediness":
    $1 leaves space so that the other terms can match.

    Update, consider:

    my $x = "This is Perl"; $x =~/^((.+)(e|r)(.*))$/; print "1={$1} 2={$2} 3={$3} 4={$4}\n"; # 1={This is Perl} 2={This is Pe} 3={r} 4={l} my $x = "This is Perl, nice Perl"; $x =~/^((.+)(e|r)(.*))$/; print "1={$1} 2={$2} 3={$3} 4={$4}\n"; # 1={This is Perl, nice Perl} 2={This is Perl, nice Pe} 3={r} 4={l}
    A small update, I changed $a to $x in the above code. In Perl, $a and $b are special variables used for among other things in sort functions. Normal user code should not use these variables except in their strange special cases. So something like $x and $y is better. In the above code using $a wouldn't matter, but I changed it anyway to point out that this is a bad habit that can lead to problems in longer programs. Just something to watch out for if you code in other languages that don't have special meanings for a or b.
      Thanks, got it!
Re: Basic Regular expression
by hippo (Bishop) on Feb 09, 2017 at 16:31 UTC

    $2 cannot be null, it must be either "e" or "r" or else the entire regex would fail to match. Hopefully the rest becomes obvious once you understand this part?

      Yes, it did. Thanks!
Re: Basic Regular expression
by AnomalousMonk (Archbishop) on Feb 09, 2017 at 18:58 UTC

    If you're dealing only with regex operators supported by Perl version 5.6 and before (as you are in the OPed example), the YAPE::Regex::Explain module can sometimes be helpful:

    c:\@Work\Perl\monks>perl -wMstrict -le "use YAPE::Regex::Explain; ;; print YAPE::Regex::Explain->new('^(.+)(e|r)(.*)$')->explain; " The regular expression: (?-imsx:^(.+)(e|r)(.*)$) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- .+ any character except \n (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ( group and capture to \2: ---------------------------------------------------------------------- e 'e' ---------------------------------------------------------------------- | OR ---------------------------------------------------------------------- r 'r' ---------------------------------------------------------------------- ) end of \2 ---------------------------------------------------------------------- ( group and capture to \3: ---------------------------------------------------------------------- .* any character except \n (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \3 ---------------------------------------------------------------------- $ before an optional \n, and the end of the string ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------


    Give a man a fish:  <%-{-{-{-<

Re: Basic Regular expression
by Laurent_R (Canon) on Feb 09, 2017 at 18:10 UTC
    You've been given good answers already, but let me just add some details on the way the regex engine processes this string.
    According to me, it should be "$1= This is Perl".
    In fact, that's what is initially happening: the regex engine sees (.+) and grabs the whole string, i.e. "this is Perl". But then, it sees (e|r), so, in order for the whole regex to be successful, it has to backtrack and give back "l" and then "r", so that (e|r) can be successful. Note that this would happen even if you did not have capturing parentheses, so that the point is not so much that it is trying to populate $2, but that (e|r) has to match something for the whole regex to be successful.

    Once it has matched the "r" with the second capture, the last part of the regex, (.*)$ can match the "l".

Re: Basic Regular expression
by Corion (Patriarch) on Feb 09, 2017 at 16:30 UTC

    In what situation can the second parenthesis be empty and still produce an overall match?

Re: Basic Regular expression
by NetWallah (Canon) on Feb 09, 2017 at 17:29 UTC
    You can achieve your desired output by using this re:
    $a =~/^(.+)(e|r)?(.*)$/;
    Update: Just noticed - this is almost the same as choroba's suggestion.

            ...it is unhealthy to remain near things that are in the process of blowing up.     man page for WARP, by Larry Wall

Re: Basic Regular expression
by tweetiepooh (Hermit) on Feb 09, 2017 at 16:48 UTC

    The answer you get is what is expected. What do you think the regex reads like?

    Start then capture 1 or more character upto "e" or "r" captured then capture anything left to end.

    Remember the match is greedy so matches the "r" in the option rather than the "e".

      Start then capture 1 or more character upto "e" or "r"

      If it was true, the OP would have received

      $1="This is P"; $2="e"; $3="rl";
      instead of
      $1="This is Pe"; $2="r"; $3="l";

      It actually matches until the end of the line as the OP expects, but its then forced to backtrack until it finds a position that's followed by e or r.