http://qs1969.pair.com?node_id=160771

You may remember my recent node about extending regular expressions to allow recursion. I've completely rewritten the implementation, so that it backtracks properly. All the known bugs have been fixed, and both the regexes that Juerd posted last time now seem to work.

The web page has been updated to reflect the changes, and the new patch is available from there. The most significant functional change is that calls must be recursive. With the previous version you could write

/(a*)b(?1)/
and it would mean the same as /(a*)ba*/. Now you can only call out to a group that's still open. I made that change for technical reasons related to the implementation (so that I could use stack rather than heap storage, basically), but I don't think it's a big problem. Let me know if you disagree.

Proper detection of left recursion is implemented as well, meaning that a regex like /((?1))/ won't even compile now.

On a slight tangent, is anyone else interested in the idea of (as an experiment) hacking perl to use the PCRE library in place of its built-in engine? I don't think it would be all that hard.

Replies are listed 'Best First'.
Re: Recursive Regex: Update
by Juerd (Abbot) on Apr 20, 2002 at 13:03 UTC

    All the known bugs have been fixed, and both the regexes that Juerd posted last time now seem to work.

    They indeed do. Actually, there is only one, the second one doesn't use a conditional but is the same regex. Here is the regex again:

    % ^ \s* ( # <1> < \s* ([a-zA-Z:]+) # <2/> (?: \s*[a-zA-Z:]* \s* = \s* (?:'[^']*'|"[^"]*") )* \s* (/\s*)? # <3/> > (?:[^<>]* | (?1))* (?(3)| <\s*/\s*\2\s*> ) ) # </1> \s* $ %x
    And this time, everything matches (or matches not) correctly:
    <foo><bar></bar></foo> # Match <foo><bar></foo></bar> # No match <foo><bar/></foo> # Match * <foo><bar></foo> # No match <foo bar=baz/> # No match <foo bar="baz"> # No match <foo bar="baz"/> # Match < fooo / > # Match <foo/>foo # No match foo<foo/> # No match <foo>foo</foo> # Match <foo><bar/>foo</foo> # Match * <a><b><c></c></b></a> # Match * This one matches as well * (wow) <wml> <card id="mail" title="C: WAP POP3"> <onevent type="onenterforward"> <refresh> <setvar name="host" value="mail.some-domain"/> </refresh> </onevent> <p> Host: <input type="text" format="*m" name="host" value="$(host)" +/> <br/> User: <input type="text" format="*m" name="user"/> <br/> Pass: <input type="password" format="*m" name="pass"/> <br/> <a href="pop3.cgi?host=$(host:e)&amp;user=$(user:e)&amp;pa +ss=$(pass:e)"> Login </a> </p> </card> </wml> This single regex can actually test XML well-formedness :) (Doesn't take <?xml?> and <!DOCTYPE> etc, though)
    Those with an * didn't work with the previous patch, but do now.

    On a slight tangent, is anyone else interested in the idea of (as an experiment) hacking perl to use the PCRE library in place of its built-in engine? I don't think it would be all that hard.

    If it were released as a module, it would be very nice. Maybe it's even possible to create pcre_s/// and pcre_m//? If not, a normal functional interface or perhaps even object oriented interface to the PCRE library would be great. I wouldn't like it replacing Perl's built-in engine.

    - Yes, I reinvent wheels.
    - Spam: Visit eurotraQ.
    

      If it were released as a module, it would be very nice.
      Well, writing a PCRE module with a procedural or object-oriented interface would be reasonably easy, if that's what you mean. Getting slightly more ambitious, if perl's regex engine were made properly pluggable, you could write a pragma to switch in the PCRE implementation.
      use PCRE; print "Well formed!\n" if /$Juerds_regex/;
      Out of interest, why wouldn't you like it replacing Perl's built-in engine? (supposing the UTF8 support were properly finished)

        Out of interest, why wouldn't you like it replacing Perl's built-in engine?

        As far as I know (although I haven't benchmarked), Perl's regex engine is faster. If it's not, PCRE may as well replace Perl's own regex engine, when UTF8 is fixed :) Then we'd have Perl with a Regex Engine that is Perl Compatible, which sounds reasonable...

        - Yes, I reinvent wheels.
        - Spam: Visit eurotraQ.
        

        if perl's regex engine were made properly pluggable

        Thats a great idea. Ive wanted to be able to use a DFA based engine under certain situations...

        Yves / DeMerphq
        ---
        Writing a good benchmark isnt as easy as it might look.

Re: Recursive Regex: Update
by BrentDax (Hermit) on Apr 22, 2002 at 18:50 UTC
    On a slight tangent, is anyone else interested in the idea of (as an experiment) hacking perl to use the PCRE library in place of its built-in engine? I don't think it would be all that hard.

    It took me a week to hack three new operators into toke.c and perly.y and add the corresponding opcodes, and Perl's tokenizer is second only to its regex engine in how difficult it is to modify. In other words, have fun--the sane people will watch from a safe distance. :^)

    =cut
    --Brent Dax
    There is no sig.

      Oh, I'm not planning to touch the regex engine. That would be lunacy, as you say! Just bypassing the current engine altogether will be much less difficult.

      I'm appropriately respectful of your bravery in venturing into the tokeniser. It scares the hell out of me.

        My things have come a long way since this thread. :-)

        Id say hacking this into the regex engine is probably a good thing. And i suspect that its probably easier to do so than to hack PCRE into perl. Even perls own regex engine is not easily properly pluggable due to a lack of hooks into sv.c's re_dup(). I suspect that getting the PCRE engine would have the same problems. Which actually reminds me, we need to make sure that Perl 5.10 has the appropriate hooks so that making pluggable regex engines for perl 5.10 and later is easier.

        ---
        $world=~s/war/peace/g