http://qs1969.pair.com?node_id=572226

Recently I was investigating a bug in blead-perl (the latest development release of perl) where some of the Regexp::Common tests were failing. One of the failing tests was for matching an arbitrarily nested balanced pattern, something that you can't actually do with a regular expression. Perl only allows it to be done because of the (??{}) syntax, which executes a piece of code and then uses its return as a pattern to be matched as though it were part of the original pattern. If the returned pattern itself contains a (??{}) the matching can become recursive such as when the returned obeject is the original. An example is this:

our $qr=qr/<(?:[<>\\]+|\\.|(??{$qr}))*>/; print "not " unless '<<><><<<>>><>>'=~/^($qr)$/; print "ok - $1\n";

Now, the idea I had was to be able to write the above as this: (With suitable handwaving about the exact notation)

print "not " unless '<<><><<<>>><>>'=~/^((?&:<(?:[<>\\]+|\\.|(?:&))*>) +)$/; print "ok - $1\n";

The idea is that that the (?&: ... ) marks a subsection of the pattern that can be recursed to. (?&) would mean recurse to the (?&:...) part the pattern. This way the statement is selfcontained, and requires no perl evaluation to occur, and requires only one compiled regexp per pattern, instead of many as the current scheme dictates (embedding a qr// in a larger pattern results in a complete recompile).

An extension of this would be to allow such subsections to named, maybe (?&name:...) and (?&name), which would I think allow some very Perl6 rule like behaviour. The addition of a matches nothing block, say (?&& ... ) block would make it possible to define a bunch of rules and then reuse them in other patterns.

my $rules=qr/(?&& # compile this stuff, but dont match it (&foo: .... ) # define ... (&bar: .... ) # ... some rules ) /x; if ($blah=~/(&foo)(&bar)$rules/) { ... }

As far as I understand it adding this kind of thing to perl5's regex engine wouldn't be particularly difficult. It would only require the addition of a regop or two, and some additional code in the optimiser. Most of the infrastructure to handle (??{}) can be reused, so the main thing is the dealing with nesting/forward declarations and things like that, stuff I dont think would be too hard to handle. Note that all of this assumes the current behaviour of (??{}) WRT capturing parens: the ones that matter are in the top level pattern only. (Although maybe that assumption can be relaxed... I dont know...)

Anyway, I was just curious what people thought of this.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re: Perl6ish rules in Perl5's regex engine
by ysth (Canon) on Sep 10, 2006 at 16:07 UTC
    and requires only one compiled regexp per pattern, instead of many as the current scheme dictates (embedding a qr// in a larger pattern results in a complete recompile).
    It was my understanding that this wasn't true - that no recompile took place when nesting with (??{}), and testing seems to bear this out:
    $ perl -we'use re "debug"; $x=qr/(.)\1|(.)(??{$x})\2/; "abcddcba" =~ / +^$x\z/' Compiling REx `(.)\1|(.)(??{$x})\2' size 20 Got 164 bytes for offset annotations. 1: BRANCH(9) 2: OPEN1(4) 4: REG_ANY(5) 5: CLOSE1(7) 7: REF1(20) 9: BRANCH(20) 10: OPEN2(12) 12: REG_ANY(13) 13: CLOSE2(15) 15: LOGICAL[2](16) 16: EVAL(18) 18: REF2(20) 20: END(0) minlen 1 with eval Offsets: [20] 0[0] 1[1] 0[0] 2[1] 3[1] 0[0] 4[2] 0[0] 6[1] 7[1] 0[0] 8[1] 9[1] 0 +[0] 17[0] 17[0] 0[0] 18[2] 0[0] 20[0] Compiling REx `^(?-xism:(.)\1|(.)(??{$x})\2)\z' size 23 Got 188 bytes for offset annotations. first at 2 1: BOL(2) 2: BRANCH(10) 3: OPEN1(5) 5: REG_ANY(6) 6: CLOSE1(8) 8: REF1(22) 10: BRANCH(21) 11: OPEN2(13) 13: REG_ANY(14) 14: CLOSE2(16) 16: LOGICAL[2](17) 17: EVAL(19) 19: REF2(22) 21: TAIL(22) 22: EOS(23) 23: END(0) floating `'$ at 1..2147483647 (checking floating) anchored(BOL) minlen + 1 with eval Offsets: [23] 1[1] 9[1] 10[1] 0[0] 11[1] 12[1] 0[0] 13[2] 0[0] 15[1] 16[1] 0[0] +17[1] 18[1] 0[0] 26[0] 26[0] 0[0] 27[2] 0[0] 28[0] 30[2] 32[0] Guessing start of match, REx `^(?-xism:(.)\1|(.)(??{$x})\2)\z' against + `abcddcba'... Found floating substr `'$ at offset 8... Guessed: match at offset 0 Matching REx `^(?-xism:(.)\1|(.)(??{$x})\2)\z' against `abcddcba' Setting an EVAL scope, savestack=15 0 <> <abcddcba> | 1: BOL 0 <> <abcddcba> | 2: BRANCH Setting an EVAL scope, savestack=21 0 <> <abcddcba> | 3: OPEN1 0 <> <abcddcba> | 5: REG_ANY 1 <a> <bcddcba> | 6: CLOSE1 1 <a> <bcddcba> | 8: REF1 failed... 0 <> <abcddcba> | 11: OPEN2 0 <> <abcddcba> | 13: REG_ANY 1 <a> <bcddcba> | 14: CLOSE2 1 <a> <bcddcba> | 16: LOGICAL[2] 1 <a> <bcddcba> | 17: EVAL re_eval 0x1003bb30 Entering embedded `(.)\1|(.)(??{$x})\2' Setting an EVAL scope, savestack=35 1 <a> <bcddcba> | 1: BRANCH Setting an EVAL scope, savestack=41 1 <a> <bcddcba> | 2: OPEN1 1 <a> <bcddcba> | 4: REG_ANY 2 <ab> <cddcba> | 5: CLOSE1 2 <ab> <cddcba> | 7: REF1 failed... 1 <a> <bcddcba> | 10: OPEN2 1 <a> <bcddcba> | 12: REG_ANY 2 <ab> <cddcba> | 13: CLOSE2 2 <ab> <cddcba> | 15: LOGICAL[2] 2 <ab> <cddcba> | 16: EVAL re_eval 0x10019550 Entering embedded `(.)\1|(.)(??{$x})\2' Setting an EVAL scope, savestack=55 2 <ab> <cddcba> | 1: BRANCH Setting an EVAL scope, savestack=61 2 <ab> <cddcba> | 2: OPEN1 2 <ab> <cddcba> | 4: REG_ANY 3 <abc> <ddcba> | 5: CLOSE1 3 <abc> <ddcba> | 7: REF1 failed... 2 <ab> <cddcba> | 10: OPEN2 2 <ab> <cddcba> | 12: REG_ANY 3 <abc> <ddcba> | 13: CLOSE2 3 <abc> <ddcba> | 15: LOGICAL[2] 3 <abc> <ddcba> | 16: EVAL re_eval 0x10019550 Entering embedded `(.)\1|(.)(??{$x})\2' Setting an EVAL scope, savestack=75 3 <abc> <ddcba> | 1: BRANCH Setting an EVAL scope, savestack=81 3 <abc> <ddcba> | 2: OPEN1 3 <abc> <ddcba> | 4: REG_ANY 4 <abcd> <dcba> | 5: CLOSE1 4 <abcd> <dcba> | 7: REF1 5 <abcdd> <cba> | 20: END Setting an EVAL scope, savestack=95 restoring \1 to 2(2)..-1 restoring \2 to 2(2)..3 5 <abcdd> <cba> | 18: REF2 6 <abcddc> <ba> | 20: END Setting an EVAL scope, savestack=109 restoring \1 to 1(1)..-1 restoring \2 to 1(1)..2 6 <abcddc> <ba> | 18: REF2 7 <abcddcb> <a> | 20: END Setting an EVAL scope, savestack=123 restoring \1 to 0(0)..-1 restoring \2 to 0(0)..1 7 <abcddcb> <a> | 19: REF2 8 <abcddcba> <> | 22: EOS 8 <abcddcba> <> | 23: END Match successful! Freeing REx: `"^(?-xism:(.)\\1|(.)(??{$x})\\2)\\z"' Freeing REx: `"(.)\\1|(.)(??{$x})\\2"'

      In this case there are two patterns (as your output shows). The first is the precompiled $qr and the second is that formed by concatenating the $qr into a larger pattern. You are correct that when processing the (??{$qr}) there is no recompile tho. When I said "many" I was thinking of scenarios where you would have multiple possible sub patterns (or rules), or where the (??{...}) returned a string instead of a pattern.

      BTW, note it does eval the code everytime.... Even if it doesnt recompile the pattern.

      ---
      $world=~s/war/peace/g

Re: Perl6ish rules in Perl5's regex engine
by audreyt (Hermit) on Sep 11, 2006 at 01:02 UTC
    A regex fixpoint! That'll be insanely great for my nefarious purposes, such as Template::Extract, Template::Generate and friends. I've been pondering compiling Perl 6 rules to PCRE, taking advantage of its callback mechanism to implement nested capturing, but a fixpoint will elegantly provide this capability as well. :-)
      Does that mean you volunteer to write the tests? :)
      I had the same idea a few years ago, and implemented it in PCRE. The author of PCRE reworked the implementation but kept the idea, and it's been a standard feature of PCRE since then. I still think it's very cool, but I don't know whether anyone actually uses it...

      I didn't have the guts to try and do it for perl's regex engine. If you do it, it might be worth considering using the same syntax as PCRE?

        If you do it, it might be worth considering using the same syntax as PCRE?

        Well, if it meant the patch was more likely to be applied then yes. OTOH, Im not super keen on using capturing buffers as the fix point. But im not totally against it either. I really need to look into PCRE more closely. How do they do named capture buffers?

        ---
        $world=~s/war/peace/g