The problem is that it's still tricky to get at the parse tree. All the parts are there -- named captures, relative backreferences, recursion -- but you still have to litter your regexp with hairy use of (?{ local $FOO{blah} = $^N })
to get the data back.
It should be possible to whip something up with re::engine and Regexp::Parser, but a brief attempt showed me that this is somewhat tricky to get right except in simple cases. It would be much more convenient to have this support built into the regex engine. Perhaps for 5.12... | [reply] [d/l] |
abigail mentioned something similar, but i havent gotten my head around what is really needed. If i had a better idea id be able to given a better estimate of when it could be done by. For instance there is a considerable amount of data on the stack after a match that might be useful for this purpose, i just dont really have a good grip on the problem to make any useful suggestions.
---
$world=~s/war/peace/g
| [reply] |
The inspiration comes from Parse::RecDescent's "autotree" directive, which basically constructs a parse tree by collecting submatches into a hash (keyed by submatch name) and blessing that into a package named according to the parent rule, e.g. A ::= B C creates an object of type A with two fields B and C.
I think a much less blessed scheme would be appropriate for the regex engine, in which numbered submatches are collected into match variables @/ and %/. For @/, each capturing group generates a capture:
- If the capturing group is quantified, it is an arrayref of captures, one for each repetition.
- Otherwise, if the capturing group contains no submatches, the capture is a string containing the captured text.
- Otherwise, $/[0] contains the captured text, and @/[1..N] contain the captures for nested groups $1 ... $N, where the indices are numbered starting from the first nested submatch.
Similarly, for %/, each capturing group yields a hash capture:
- $/{0} is the entire text.
- If (?'NAME'...) has no subcaptures, $/{NAME} is the captured text.
- Otherwise if (?'NAME'...) is quantified, $/{NAME} is an arrayref of hash captures, one entry for each repetition.
- Otherwise, $/{NAME} is a hash capture.
- (maybe) $/{1} through $/{N} are like $/{NAME}, but for all captures, not just named ones.
For example, I believe the above rules should yield:
$sexp = qr/
(?<sexp>
\s* \( \s* (?&sexp)* \s* \) \s*
| \s* (?<atom>\w+) \s*
)/x;
'(A (B C))' =~ /$sexp/;
## AFTERWARDS
%/ = (sexp => {
0 => '(A (B C))',
sexp => [{ 0 => 'A', atom => 'A' },
{ 0 => '(B C)',
sexp => [{ 0 => 'B', atom => 'B' },
{ 0 => 'C', atom => 'C' }] }]};
@/ = ('(A (B C))',
[['A', 'A'],
['(B C)', [['B', 'B'],
['C', 'C']]]]);
There are probably some obvious oversights here, but I'll try to get the Regexp::Parser version of this working to shake the bugs out.
| [reply] [d/l] [select] |