Serge314 has asked for the wisdom of the Perl Monks concerning the following question:

The regex
'ab' =~ /((\w+)(?{print defined $2 ? "\$2=$2\n" : "\$2 not defined\n"})){2}/;
outputs:
$2=ab $2 not defined $2=b
Why $2 not defined? I think, the regex here must print $2=a. Is it a bug?

Replies are listed 'Best First'.
Re: A bug in Perl regex(?)
by Corion (Patriarch) on Feb 18, 2011 at 15:32 UTC

    See (your?) post to perl5-porters with the same report. What problems do you have with the replies you got there?

    Update: Removed German-specific Google query parameter

      Thanks, it's my bug report. Here's my answer to Eric Brine

      Let's present the re

      'ab' =~ /((\w+)(?{print defined $2 ? "\$2=$2\n" : "\$2 not defined\n"})){2}/;

      as

      ((\w+)(?{print...}))((\w+)(?{print...}))

      Is \w{2} equivalent to \w\w, right? But we assume that the second copy of the re produces also the same $1 and $2 (not $3 and $4). Current position in the re marked with |.

      1. First (\w+) captures all the text:
      ((\w+) | (?{print...}))((\w+)(?{print...}))
      $2 receives the value 'ab', eval prints $2=ab.

      2. Then we enter second copy of (\w+):
      ((\w+)(?{print...}))(( | \w+)(?{print...}))
      $2 (and also $+, $^N, \2) receives the value undefined.

      3. We see that \w not match. We do backtracking:
      ((\w+ | )(?{print...}))((\w+)(?{print...}))
      We enter first copy of (\w+) from right to left, and $2 again receives the value undefined.

      4. (\w+) captures the letter a:
      ((\w+) | (?{print...}))((\w+)(?{print...}))
      $2 must receive the value a, but in current version of Perl $2 receives
      undefined... Why? Probably, two values of undefined are stored in $2 as in a stack,
      then last value is removed from the stack, and $2 again equal undefined?
      Here eval must print $2=a.

      5. Second copy of (\w+) captures the letter b:
      ((\w+)(?{print...}))((\w+) | (?{print...}))
      Eval prints $2=b. Match successfull.

      Do you see any mistake in this reasoning?

        Sorry for my poor English.
        After previous post I've thought once again and now I think than intuitively $2=undefined should be incorrect, and $2=a correct.

        After that I've received an email from guru of regex Jeffrey Friedl (regex.info):
        ---

        Hi Serge,
        I've been thinking about this for a while, and as far as I can tell it does seem
        to be a bug. By definition, $2 must be defined before the (?{...}) can run.
        It's probably a problem with how it backtracks. I'd suggest filing a bug report..

        ---
        Splitting the regex:
        ((\w+)(?{print...}))((\w+)(?{print...}))
        is wrong, really the regex is not split.
        After (\w+) captures all the string:
        (\w+)) | {2}
        we see, that second repetition of \w not match. We do backtracking and enter second parentheses going from right to left:
        (/w | )+
        In this case the regex engine (as I think) set $2=undefined, but why? Intuitively it seems set $2=undefined should do after we leave the open second parenthesis going from right to left.

Re: A bug in Perl regex(?)
by kennethk (Abbot) on Feb 18, 2011 at 15:51 UTC
    This might be somewhat buggy behavior, but here is how I am interpreting the events. Because of your {2}, the pattern you are ultimately trying to match is /((\w+))((\w+))/. However, as YAPE::Regex::Explain points out,
    NOTE: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in [$2]
    On the first time through, \w+ grabs both letters are stores them in $2, and prints ab as expected. It then sees the repetition at the end, shifts the pointer for the second buffer to the second repetition (what would be $4 in my unrolled version). This means when the first attempt fails and you grab a on your second attempt, $2 doesn't point there anymore - it points to the second buffer in the second iteration.

    I think (though this is subject to argument) that the correct behavior should not be what you claim, but should be

    $2 not defined $2 not defined $2=b
    since the final $2 buffer is not populated until your second iteration. In the end, it just goes to emphasize perlreftut's warning:
    Be warned that this feature is considered experimental, and may be changed without notice.
        Please expound on why you believe this should be the output. To my understanding, the regular expression specification defines output but not method. If there is a specification or archived developer discussion I am unaware of, I would appreciate the citation. Otherwise, I do not see a compelling argument for "it should be the last thing matched by those 'physical' parentheses" over my proposal.