Larry Wall has dictated that in Perl 6, all regular expressions will be represented as bytecode in the same format as the rest of the program. For example, there might be an anything op which was equivalent to '.'. Naturally, a bunch of people (including myself) have started trying to hack RE support into Parrot. My current approach is something like this:
match "afoobarz", "i" RE_0: goforward END literal "f" RE_1: lazyrepeat literal "o" saveindex literal "bar" backtrack RE_1 backtrack RE_0 reend I0
That's equivalent to "afoobarz"=~/fo*?bar/i. While obviously it's better than nothing, I still feel like it's suboptimal.

Can some regexp internals experts give me an idea of how something like this would best be set up? I'm trying to stay within the realm of normal Parrot opcodes with some extra data structures in the C code to hold things like the current index; that means that you should think in terms of labels, not trees. I don't really care about speed at this point, just about something that works.

Thanks in advance!

=cut
--Brent Dax
There is no sig.

Replies are listed 'Best First'.
Re: Flattening REs into opcodes for Perl 6
by japhy (Canon) on Nov 02, 2001 at 20:41 UTC
    You feel a pair of eyes staring at you, from behind the trees.

    Well, I guess I'll pipe up. First, I would like to see a parallel between the actual node names and the pseudocode you have above. Here's how /fo*?bar/i is currently represented (via -Dr or -mre=debug):

    1: EXACTF <f>(3) 3: MINMOD(4) 4: STAR(7) 5: EXACTF <o>(0) 7: EXACTF <bar>(10) 10: END(0)
    I'd like to see something like that. The numbers in parentheses represent where to go after the node has matched -- a value of 0 is obviously a special one.
    start: exactf "f", find_o advance goto start find_o: exactf "o" exactf "bar", done goto find_o done: end
    That looks about right to me. Here we have exactf taking one or two arguments -- the string to match, and the node to jump to if successful (optional). If you want to see how /fo*bar/ differs...
    start: exactf "f", find_o advance goto start find_o: save exactf "o", find_o find_bar: exactf "bar", done restore goto find_bar done: end
    Here, we've added find_o as the second argument to exactf "o", making it execute itself as many times as it can (greedy). We've also added two important instructions (which aren't all that important, given the text of the regex), save and restore. These basically make a push-pop method for recording and going back to a location in the string.

    Notice this code shows absolutely NOTHING in the way of optimizations. Here's the optimized code for /fo*?bar/:

    start: exactf "f", find_o advance goto start find_o: exactf "o", find_o exactf "bar", done goto start done: end
    And here's the (surprisingly similar!) optimized code for /fo*bar/:
    start: exactf "f", find_o advance goto start find_o: exactf "o", find_o exactf "bar", done goto start done: end
    They're the same because the regexes are one in the same in this specific example. More complex regexes will behave differently. This regex just happens to be wonderfully implemented like a DFA, really...

    Well, I hope this has been a good example.

    _____________________________________________________
    Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
    s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

      First of all, thanks for replying.

      start: exactf "f", find_o advance goto start find_o: exactf "o" exactf "bar", done goto find_o done: end
      How does that deal with being matched against, say, "afafoobarz"? Perhaps:
      start: save exactf "f", find_o restore #to keep the balance advance goto start find_o: exactf "o", find_bar restore advance goto start find_bar: exactf "bar", done goto find_o ...
      would be more correct? Actually, come to think of it, I don't see how it deals with a case like "babylon" (it doesn't match at all) either--maybe advance should have a parameter that it jumps to if it fails?

      =cut
      --Brent Dax
      There is no sig.

        Ooh, right, thanks for pointing that out. Let's see...
        start: exactf "f", find_o advance quit goto start find_o: # non-greedy, by the way goto find_bar find_o_real: exactf "o", find_o goto start find_bar: exactf "bar", done goto find_o_real done: end quit: fail
        I should point out that I haven't optimized find_o out for readability purposes, and that in "real life", this would look like:
        start: exactf "f", find_bar advance quit goto start find_o_real: exactf "o", find_bar goto start find_bar: exactf "bar", done goto find_o_real done: end quit: fail
        This feels like drawing a DFA... honestly. I'm not sure where save and restore are going to come into play in this particular example -- it seems not at all.

        _____________________________________________________
        Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
        s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;