in reply to Flattening REs into opcodes for Perl 6

You feel a pair of eyes staring at you, from behind the trees.

Well, I guess I'll pipe up. First, I would like to see a parallel between the actual node names and the pseudocode you have above. Here's how /fo*?bar/i is currently represented (via -Dr or -mre=debug):

1: EXACTF <f>(3) 3: MINMOD(4) 4: STAR(7) 5: EXACTF <o>(0) 7: EXACTF <bar>(10) 10: END(0)
I'd like to see something like that. The numbers in parentheses represent where to go after the node has matched -- a value of 0 is obviously a special one.
start: exactf "f", find_o advance goto start find_o: exactf "o" exactf "bar", done goto find_o done: end
That looks about right to me. Here we have exactf taking one or two arguments -- the string to match, and the node to jump to if successful (optional). If you want to see how /fo*bar/ differs...
start: exactf "f", find_o advance goto start find_o: save exactf "o", find_o find_bar: exactf "bar", done restore goto find_bar done: end
Here, we've added find_o as the second argument to exactf "o", making it execute itself as many times as it can (greedy). We've also added two important instructions (which aren't all that important, given the text of the regex), save and restore. These basically make a push-pop method for recording and going back to a location in the string.

Notice this code shows absolutely NOTHING in the way of optimizations. Here's the optimized code for /fo*?bar/:

start: exactf "f", find_o advance goto start find_o: exactf "o", find_o exactf "bar", done goto start done: end
And here's the (surprisingly similar!) optimized code for /fo*bar/:
start: exactf "f", find_o advance goto start find_o: exactf "o", find_o exactf "bar", done goto start done: end
They're the same because the regexes are one in the same in this specific example. More complex regexes will behave differently. This regex just happens to be wonderfully implemented like a DFA, really...

Well, I hope this has been a good example.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Re: Flattening REs into opcodes for Perl 6
by BrentDax (Hermit) on Nov 03, 2001 at 03:45 UTC
    First of all, thanks for replying.

    start: exactf "f", find_o advance goto start find_o: exactf "o" exactf "bar", done goto find_o done: end
    How does that deal with being matched against, say, "afafoobarz"? Perhaps:
    start: save exactf "f", find_o restore #to keep the balance advance goto start find_o: exactf "o", find_bar restore advance goto start find_bar: exactf "bar", done goto find_o ...
    would be more correct? Actually, come to think of it, I don't see how it deals with a case like "babylon" (it doesn't match at all) either--maybe advance should have a parameter that it jumps to if it fails?

    =cut
    --Brent Dax
    There is no sig.

      Ooh, right, thanks for pointing that out. Let's see...
      start: exactf "f", find_o advance quit goto start find_o: # non-greedy, by the way goto find_bar find_o_real: exactf "o", find_o goto start find_bar: exactf "bar", done goto find_o_real done: end quit: fail
      I should point out that I haven't optimized find_o out for readability purposes, and that in "real life", this would look like:
      start: exactf "f", find_bar advance quit goto start find_o_real: exactf "o", find_bar goto start find_bar: exactf "bar", done goto find_o_real done: end quit: fail
      This feels like drawing a DFA... honestly. I'm not sure where save and restore are going to come into play in this particular example -- it seems not at all.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Would this work? I don't have the ops integrated into Parrot's normal set yet, but I'd like to know if I'm on the right track. re_match is like calling RE_0 as a subroutine, and re_finished is like returning from it. The set is an assignment (in this case to Integer register 0); pretty much everything else that may look odd in here has been explained inline.
        re_match RE_0, "afoobarz" RE_0: # /fo*?bar/s (and yes, the "s" flag is pointless) re_flags "s" re_minlength 4 $start: #a 'local' label--it disappears the next time a normal # label is seen re_pushindex #in case we fail in $find_o or something re_literal "f", $find_bar re_popindex re_advance $failure branch $start #think "goto" $find_o: re_literal "o", $find_bar branch $start $find_bar: re_literal "bar", $success branch $find_o $success: set I0, 1 branch $end $failure: set I0, 0 $end: re_finished
        Looking at this, it seems that most of the time we use the second parameter we're trying to avoid a branch put in in case we fail; is this just a quirk of this example, or might it make more sense for the second parameter to represent where to jump if we fail?
        $start: re_pushindex re_literal "f", $advance $find_bar: re_literal "bar", $find_o #if we made it this far, we're done! set I0, 1 branch $end $find_o: re_literal "o", $advance branch $find_bar $advance: re_popindex re_advance $failure branch $start $failure: set I0, 0 $end: re_finished
        This would be a bigger win in the case of, say, several character classes in a row:
        RE_1: #/[az][by][cx]/ re_flags "" re_minlength 3 $start: re_pushindex re_oneof "az", $advance re_oneof "by", $advance re_oneof "cx", $advance set I0, 1 branch $end $advance: re_popindex re_advance $fail branch $start $fail: set I0, 0 $end: re_finished

        =cut
        --Brent Dax
        There is no sig.