Re: A bug in Perl regex(?)
by Corion (Patriarch) on Feb 18, 2011 at 15:32 UTC
|
See (your?) post to perl5-porters with the same report. What problems do you have with the replies you got there?
Update: Removed German-specific Google query parameter
| [reply] |
|
|
Thanks, it's my bug report. Here's my answer to Eric Brine
Let's present the re
'ab' =~ /((\w+)(?{print defined $2 ? "\$2=$2\n" : "\$2 not defined\n"})){2}/;
as
((\w+)(?{print...}))((\w+)(?{print...}))
Is \w{2} equivalent to \w\w, right? But we assume that the second copy of the
re produces also the same $1 and $2 (not $3 and $4). Current position in the re
marked with |.
1. First (\w+) captures all the text:
((\w+) | (?{print...}))((\w+)(?{print...}))
$2 receives the value 'ab', eval prints $2=ab.
2. Then we enter second copy of (\w+):
((\w+)(?{print...}))(( | \w+)(?{print...}))
$2 (and also $+, $^N, \2) receives the value undefined.
3. We see that \w not match. We do backtracking:
((\w+ | )(?{print...}))((\w+)(?{print...}))
We enter first copy of (\w+) from right to left, and $2 again receives the value undefined.
4. (\w+) captures the letter a:
((\w+) | (?{print...}))((\w+)(?{print...}))
$2 must receive the value a, but in current version of Perl $2 receives
undefined... Why? Probably, two values of undefined are stored in $2 as in a stack,
then last value is removed from the stack, and $2 again equal undefined?
Here eval must print $2=a.
5. Second copy of (\w+) captures the letter b:
((\w+)(?{print...}))((\w+) | (?{print...}))
Eval prints $2=b. Match successfull.
Do you see any mistake in this reasoning?
| [reply] |
|
|
Sorry for my poor English.
After previous post I've thought once again and now I think than intuitively $2=undefined should be incorrect, and $2=a correct.
After that I've received an email from guru of regex Jeffrey Friedl (regex.info):
---
Hi Serge,
I've been thinking about this for a while, and as far as I can tell it does seem
to be a bug. By definition, $2 must be defined before the (?{...}) can run.
It's probably a problem with how it backtracks. I'd suggest filing a bug report..
---
Splitting the regex:
((\w+)(?{print...}))((\w+)(?{print...}))
is wrong, really the regex is not split.
After (\w+) captures all the string:
(\w+)) | {2}
we see, that second repetition of \w not match. We do backtracking and enter second parentheses going from right to left:
(/w | )+
In this case the regex engine (as I think) set $2=undefined, but why? Intuitively it seems set $2=undefined should do after we leave the open second parenthesis going from right to left.
| [reply] |
|
|
| [reply] |
Re: A bug in Perl regex(?)
by kennethk (Abbot) on Feb 18, 2011 at 15:51 UTC
|
This might be somewhat buggy behavior, but here is how I am interpreting the events. Because of your {2}, the pattern you are ultimately trying to match is /((\w+))((\w+))/. However, as YAPE::Regex::Explain points out,
NOTE: because you are using a quantifier on this capture, only the LAST repetition of the captured pattern will be stored in [$2]
On the first time through, \w+ grabs both letters are stores them in $2, and prints ab as expected. It then sees the repetition at the end, shifts the pointer for the second buffer to the second repetition (what would be $4 in my unrolled version). This means when the first attempt fails and you grab a on your second attempt, $2 doesn't point there anymore - it points to the second buffer in the second iteration.
I think (though this is subject to argument) that the correct behavior should not be what you claim, but should be
$2 not defined
$2 not defined
$2=b
since the final $2 buffer is not populated until your second iteration. In the end, it just goes to emphasize perlreftut's warning: Be warned that this feature is considered experimental, and may be changed without notice. | [reply] [d/l] [select] |
|
|
$2=ab
$2=a
$2=b
| [reply] [d/l] |
|
|
Please expound on why you believe this should be the output. To my understanding, the regular expression specification defines output but not method. If there is a specification or archived developer discussion I am unaware of, I would appreciate the citation. Otherwise, I do not see a compelling argument for "it should be the last thing matched by those 'physical' parentheses" over my proposal.
| [reply] |
|
|
|
|