in reply to Re: Re: Flattening REs into opcodes for Perl 6
in thread Flattening REs into opcodes for Perl 6

Ooh, right, thanks for pointing that out. Let's see...
start: exactf "f", find_o advance quit goto start find_o: # non-greedy, by the way goto find_bar find_o_real: exactf "o", find_o goto start find_bar: exactf "bar", done goto find_o_real done: end quit: fail
I should point out that I haven't optimized find_o out for readability purposes, and that in "real life", this would look like:
start: exactf "f", find_bar advance quit goto start find_o_real: exactf "o", find_bar goto start find_bar: exactf "bar", done goto find_o_real done: end quit: fail
This feels like drawing a DFA... honestly. I'm not sure where save and restore are going to come into play in this particular example -- it seems not at all.

_____________________________________________________
Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

Replies are listed 'Best First'.
Re: Re: Re: Re: Flattening REs into opcodes for Perl 6
by BrentDax (Hermit) on Nov 03, 2001 at 09:41 UTC
    Would this work? I don't have the ops integrated into Parrot's normal set yet, but I'd like to know if I'm on the right track. re_match is like calling RE_0 as a subroutine, and re_finished is like returning from it. The set is an assignment (in this case to Integer register 0); pretty much everything else that may look odd in here has been explained inline.
    re_match RE_0, "afoobarz" RE_0: # /fo*?bar/s (and yes, the "s" flag is pointless) re_flags "s" re_minlength 4 $start: #a 'local' label--it disappears the next time a normal # label is seen re_pushindex #in case we fail in $find_o or something re_literal "f", $find_bar re_popindex re_advance $failure branch $start #think "goto" $find_o: re_literal "o", $find_bar branch $start $find_bar: re_literal "bar", $success branch $find_o $success: set I0, 1 branch $end $failure: set I0, 0 $end: re_finished
    Looking at this, it seems that most of the time we use the second parameter we're trying to avoid a branch put in in case we fail; is this just a quirk of this example, or might it make more sense for the second parameter to represent where to jump if we fail?
    $start: re_pushindex re_literal "f", $advance $find_bar: re_literal "bar", $find_o #if we made it this far, we're done! set I0, 1 branch $end $find_o: re_literal "o", $advance branch $find_bar $advance: re_popindex re_advance $failure branch $start $failure: set I0, 0 $end: re_finished
    This would be a bigger win in the case of, say, several character classes in a row:
    RE_1: #/[az][by][cx]/ re_flags "" re_minlength 3 $start: re_pushindex re_oneof "az", $advance re_oneof "by", $advance re_oneof "cx", $advance set I0, 1 branch $end $advance: re_popindex re_advance $fail branch $start $fail: set I0, 0 $end: re_finished

    =cut
    --Brent Dax
    There is no sig.

      (The previous content was erroneous.)

      It looks good; I'm tired, though. I'm going to sleep, and I'm going to be programming C all day (well, until 3:00 pm) tomorrow, so I'll get back to you.

      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker.
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;

        Sorry, one more thing. :^)

        Angel Faus and I were discussing the best way to handle multiple backtracking thingies in the same regexp without one backtracking thingie trying to use another's index. I suggested pushing a mark (a special value, probably -1) onto the stack; if rePopindex popped a mark, it will have 'failed', and thus it would jump to the address given as its parameter. That would work something like this:

        RE: #set up flags and stuff branch $start $advance: rePopindex reAdvance $fail $start: reLiteral "f", $advance rePushindex rePushmark $findo: rePushindex reLiteral "o", $findbar branch $findo $backo: rePopindex $advance $findbar: reLiteral "o", $backo set I0, 1 reFinished $fail: set I0, 0 reFinished
        Angel suggested that when we push an address onto the stack, we give it a label as a parameter; if a matching operation fails, it pops an index/label pair off the stack, sets the index and jumps to the label. Something like this:
        RE: #set up flags and stuff reOnfail $fail #if there's nothing left on the stack branch $start $advance: reAdvance $start: rePushindex $advance reLiteral "f" $findo: rePushindex $findbar reLiteral "o" branch $findo $findbar: reLiteral "bar" set I0, 1 reFinished $fail: set I0, 0 reFinished
        Can you give me any guidance about which will work better in the long run? Angel's idea has ease-of-code-generation and code compactness going for it; mine has explicitness (for debugging) and flexibility. From what I can tell, alternation will be a bitch with either, but mine should be easier; same with optional constructs and lookaheads, from what I can see. (Of course, I could just be blinded because one is my idea and the other is someone else's; that's why I'm asking you.)

        Thank you for helping me out with this. Discussing it with an expert has made this a hundred times better than it was.

        =cut
        --Brent Dax
        There is no sig.