This actually is not a bug. It is just a slightly counter-intuitive result of how @+/@-, (?(DEFINE) ..) and named-captures/named-subroutines all work, and probably could have been implemented slightly differently without any harm, but as of now, the behaviour probably cannot be changed.
First, I modified a version of your code from Re^4: Strange behavior of @- and @+ in perl5.10 regexps:
Which outputs:
before:
@- = (0) 1 items
@+ = (0, , , , , ) 6 items
expr:
@- = (0, , , , 0, 1) 6 items
@+ = (2, , , , 1, 2) 6 items
expr:
@- = (0, 0, , , 2, 3) 6 items
@+ = (4, 2, , , 3, 4) 6 items
after:
@- = (0, 0, 2) 3 items
@+ = (4, 2, 4, , , ) 6 items
matches: (abcd)
At the very end:
@- = (0, 0, 2) 3 items
@+ = (4, 2, 4, , , ) 6 items
So, first, if you look at Perl 5.10.x perlvar under @- and @+ you will see the following documentation. I have bolded the relevent bits.
- @LAST_MATCH_END
- @+
-
This array holds the offsets of the ends of the last successful submatches in the currently active dynamic scope. $+[0] is the offset into the string of the end of the entire match. This is the same value as what the pos function returns when called on the variable that was matched against. The nth element of this array holds the offset of the nth submatch, so $+[1] is the offset past where $1 ends, $+[2] the offset past where $2 ends, and so on.
You can use $#+ to determine how many subgroups were in the last successful match.
See the examples given for the "@-" variable.
- @LAST_MATCH_START
- @-
-
$-[0] is the offset of the start of the last successful match. $-[$n] is the offset of the start of the substring matched by n-th
subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with substr $_, $-[n], $+[n] - $-[n] if $-[n] is defined, and $+ coincides with substr $_, $-[$#-], $+[$#-] - $-[$#-]. One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with @+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The nth element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Now, you may wonder, ok, well then, "why six elements"? Because it is not at first obvious, as it appears there are only four capture buffers being used in the pattern, so there should be five slots used (the zeroth element is used to track $&). However there are actually five capture buffers in this pattern, as one is reserved for the (?<expr> ... ), although it doesn't get set because it is in the (?(DEFINE) (?<expr>...)) and is only ever executed as (?&expr) which actually never executes the /capture/ part of the (?<expr> ... ) so the 4th slot of the pattern never gets populated.
This was actually a deliberate design decision, consider that it would be awkward if /(?<foo>foo)((?&foo))/ resulted in $1 and $2 pointing at the same string, however maybe what happens to a capture buffer defined in a DEFINE block should have been reviewed once (?(DEFINE) ...) was introduced. The development of these features was somewhat organic, with a lot of it actually just being "tricks", for instance (?(DEFINE) ... ) isn't really special, at heart it is just an optimized alias of (?(0) ... ), (with some error checking to disallow an ELSE block), and subroutines just piggy back on named capture, so... Well, as is sometimes said of Perl core-dev, its all a bit of a game of Jenga. :-)
While it might be arguable that there should not be a slot reserved for a named capture buffer defined in a (?(DEFINE) ... ) block, the fact that @- and @+ are not the same size is a deliberate choice, and the behaviour you are seeing is expected, although admittedly in this context the results are bit odd looking.
HTH
Note:I rejected the bug report you filed on this, thanks anyway. It did raise an interesting question that I will think on.
---
$world=~s/war/peace/g
| [reply] [d/l] [select] |
Oh, just to pre-empt the question, "so why does @- have 6 elements inside of (?&expr) but not outside it", which I suspect is likely to come up.
The answer is that effectively the values of $4 and $5 are localized to the scope of the (?&expr) "subroutine", so once the subpattern matches and does its "return" back to the previous context they are reverted to their previous undefined value. Again this is for good reasons, try using a subroutine in a pattern that ISNT defined in a (?(DEFINE) ... ) and play around with it. In that case you very much don't want the use of a named capture as a subroutine to pollute its use as a named capture.
---
$world=~s/war/peace/g
| [reply] |
Hmm, probably not documented directly and might not be tested explicitly. But certainly indirectly. We have lots of tests that $1 and friends behave "as expected" inside of (?{ ... }) and (??{ ... }) blocks. So effectively that means that @- and @+ have to as well, as they are all just ties into the same C level data structures.
Now, at a certain level these constructs are still documented as experimental or subject to change so technically you have a point, and I appreciate that you pointed this out.
But I personally would/do see problems with the magic variables inside of these constructs as a bugs, the experimental status just says I get to change my mind if I want. :-) However in this case things are working pretty much exactly as planned, with the possible nit as to whether (?<expr> ... ) should have a slot allocated to it that never gets used. Which is mostly irritating as it is wasteful, and a little counter-intuitive, but actually expected behaviour.
---
$world=~s/war/peace/g
| [reply] |
Both the documentation of 5.8 and 5.10 point out the fact
@LAST_MATCH_END
@+
...
You can use $#+ to determine how many
subgroups were in the last successful match.
...
@LAST_MATCH_START
@-
...
One can use "$#-" to find the last matched
subgroup in the last successful match.
Contrast with $#+, the number of subgroups
in the regular expression.
When running the following program with perl v5.8.8 or
v5.10.1
pl@nereida:~/Lperltesting$ cat abigail1.pl
#!/usr/bin/perl
local $" = ", ";
"a" =~ /(a)|(b)/;
print "\@- = (@-)\t length of \@- = ".((scalar @-)."\t last - index =
+$#-\n");
print "\@+ = (@+)\t length of \@+ = ".((scalar @+)."\t last + index =
+$#+\n")
produces the same output:
pl@nereida:~/Lperltesting$ perl5.10.1 ./abigail1.pl
@- = (0, 0) length of @- = 2 last - index = 1
@+ = (1, 1, ) length of @+ = 3 last + index = 2
pl@nereida:~/Lperltesting$ perl5.8.8 ./abigail1.pl
@- = (0, 0) length of @- = 2 last - index = 1
@+ = (1, 1, ) length of @+ = 3 last + index = 2
and so the behavior of the perl 5.10 regexp engine is perfectly right. The last matched subgroup is subgroup 1 but there were 2 groups in the last successful match. | [reply] [d/l] [select] |
Many thanks for your kind answers,
It has been most helpful,
I am absolutely convinced it is not a bug:
I just have read the email from Abigail where he points
out that the same behavior appears in previous versions
of Perl!:
pl@nereida:~$ perl -v
This is perl, v5.8.8 built for x86_64-linux-gnu-thread-multi
Copyright 1987-2006, Larry Wall
Perl v5.8.8 regexp engine also produces @+ and @- of different sizes:
pl@nereida:~$ perl -wde 0
Loading DB routines from perl5db.pl version 1.28
Editor support available.
Enter h or `h h' for help, or `man perldebug' for more help.
main::(-e:1): 0
DB<1> "a" =~ /(a)|(b)/; print ((scalar @-)."\n"); print ((scalar @+)."
+\n")
2
3
Many thanks for your help
| [reply] [d/l] [select] |