in reply to Re^2: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

As for the back-refs and recursive parse, that's going to be more in demand, not less. For example .NET implementation of Perl regex was designed to be able to parse XML without a perf hit and while I couldn't say if Perl would take a perf hit for something like that, it definitely can parse XML with regex much easier then .NET.

Im not sure what performance hit you mean. Although ambrus had some neat ideas for something like the super-linear cache with regards to pattern level recursion that i have yet to follow up on.

Also juerd has published a validation regex for xml that is derived from the EBNF that w3 has posted. The story that you cant parse XML or HTML with regexes is going to go up in a puff of smoke when Perl 5.10 comes out.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^4: Perl regexp matching is slow??
by educated_foo (Vicar) on Feb 27, 2007 at 18:20 UTC
    The problem is that it's still tricky to get at the parse tree. All the parts are there -- named captures, relative backreferences, recursion -- but you still have to litter your regexp with hairy use of
    (?{ local $FOO{blah} = $^N })
    to get the data back.

    It should be possible to whip something up with re::engine and Regexp::Parser, but a brief attempt showed me that this is somewhat tricky to get right except in simple cases. It would be much more convenient to have this support built into the regex engine. Perhaps for 5.12...

      abigail mentioned something similar, but i havent gotten my head around what is really needed. If i had a better idea id be able to given a better estimate of when it could be done by. For instance there is a considerable amount of data on the stack after a match that might be useful for this purpose, i just dont really have a good grip on the problem to make any useful suggestions.

      ---
      $world=~s/war/peace/g

        The inspiration comes from Parse::RecDescent's "autotree" directive, which basically constructs a parse tree by collecting submatches into a hash (keyed by submatch name) and blessing that into a package named according to the parent rule, e.g. A ::= B C creates an object of type A with two fields B and C.

        I think a much less blessed scheme would be appropriate for the regex engine, in which numbered submatches are collected into match variables @/ and %/. For @/, each capturing group generates a capture:

        • If the capturing group is quantified, it is an arrayref of captures, one for each repetition.
        • Otherwise, if the capturing group contains no submatches, the capture is a string containing the captured text.
        • Otherwise, $/[0] contains the captured text, and @/[1..N] contain the captures for nested groups $1 ... $N, where the indices are numbered starting from the first nested submatch.
        Similarly, for %/, each capturing group yields a hash capture:
        • $/{0} is the entire text.
        • If (?'NAME'...) has no subcaptures, $/{NAME} is the captured text.
        • Otherwise if (?'NAME'...) is quantified, $/{NAME} is an arrayref of hash captures, one entry for each repetition.
        • Otherwise, $/{NAME} is a hash capture.
        • (maybe) $/{1} through $/{N} are like $/{NAME}, but for all captures, not just named ones.
        For example, I believe the above rules should yield:
        $sexp = qr/ (?<sexp> \s* \( \s* (?&sexp)* \s* \) \s* | \s* (?<atom>\w+) \s* )/x; '(A (B C))' =~ /$sexp/; ## AFTERWARDS %/ = (sexp => { 0 => '(A (B C))', sexp => [{ 0 => 'A', atom => 'A' }, { 0 => '(B C)', sexp => [{ 0 => 'B', atom => 'B' }, { 0 => 'C', atom => 'C' }] }]}; @/ = ('(A (B C))', [['A', 'A'], ['(B C)', [['B', 'B'], ['C', 'C']]]]);

        There are probably some obvious oversights here, but I'll try to get the Regexp::Parser version of this working to shake the bugs out.