Recently I was investigating a bug in blead-perl (the latest development release of perl) where some of the Regexp::Common tests were failing. One of the failing tests was for matching an arbitrarily nested balanced pattern, something that you can't actually do with a regular expression. Perl only allows it to be done because of the (??{}) syntax, which executes a piece of code and then uses its return as a pattern to be matched as though it were part of the original pattern. If the returned pattern itself contains a (??{}) the matching can become recursive such as when the returned obeject is the original. An example is this:

our $qr=qr/<(?:[<>\\]+|\\.|(??{$qr}))*>/; print "not " unless '<<><><<<>>><>>'=~/^($qr)$/; print "ok - $1\n";

Now, the idea I had was to be able to write the above as this: (With suitable handwaving about the exact notation)

print "not " unless '<<><><<<>>><>>'=~/^((?&:<(?:[<>\\]+|\\.|(?:&))*>) +)$/; print "ok - $1\n";

The idea is that that the (?&: ... ) marks a subsection of the pattern that can be recursed to. (?&) would mean recurse to the (?&:...) part the pattern. This way the statement is selfcontained, and requires no perl evaluation to occur, and requires only one compiled regexp per pattern, instead of many as the current scheme dictates (embedding a qr// in a larger pattern results in a complete recompile).

An extension of this would be to allow such subsections to named, maybe (?&name:...) and (?&name), which would I think allow some very Perl6 rule like behaviour. The addition of a matches nothing block, say (?&& ... ) block would make it possible to define a bunch of rules and then reuse them in other patterns.

my $rules=qr/(?&& # compile this stuff, but dont match it (&foo: .... ) # define ... (&bar: .... ) # ... some rules ) /x; if ($blah=~/(&foo)(&bar)$rules/) { ... }

As far as I understand it adding this kind of thing to perl5's regex engine wouldn't be particularly difficult. It would only require the addition of a regop or two, and some additional code in the optimiser. Most of the infrastructure to handle (??{}) can be reused, so the main thing is the dealing with nesting/forward declarations and things like that, stuff I dont think would be too hard to handle. Note that all of this assumes the current behaviour of (??{}) WRT capturing parens: the ones that matter are in the top level pattern only. (Although maybe that assumption can be relaxed... I dont know...)

Anyway, I was just curious what people thought of this.

---
$world=~s/war/peace/g


In reply to Perl6ish rules in Perl5's regex engine by demerphq

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.