in reply to Re^5: Perl regexp matching is slow??
in thread Perl regexp matching is slow??

Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.

Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it

Thanks thats really great. Ive already started work on beefing it up some to handle the Perl syntax and Ive completed some changes that turn it into a floating matcher, which I include below.

Regarding whether we want it, of course we do. :-) For our purposes of trying to shoehorn this stuff into the existing engine its easier to start with a minimal proof of concept than a fullblown engine. Somebody who wants that should do it as a plug in engine replacement. Me, im more concerned with making the out the package engine better and therefore a less polished and simpler framework is in many respects easier to deal with.

Id really like to be able to swap it in on the fly. Alhough TimToady pointed something out that gives me some pause, which is that normal Perl alternation matches leftmost-longest. Which effectively means that without extra syntax (which itself would mean changes to the normal engine as well) we can't swap it in when there is alternation in place. And to make the backtracing engine do longest-token would mean trying every alternation, so adding syntax like (?|...|...) which would be longest token alternation, would mean that if thompsons algorithm was /not/ swapped in the resulting regex would be slower. (Of course if we could detect that the DFA and NFA would match different things then this wouldnt be an issue, so maybe thats the right approach.)

Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

Anyway, here is a modified match routine for dfa1.c that does floating matches. Note this requires extending the Dstate struct to know that it has accepted.

/* Run DFA to determine whether there is a matching string inside of s +. */ int match(DState *start, char *s) { DState *d, *next; int c; char *realstart = s; char *spos = s; char *epos = NULL; d = start; for(; *s; s++){ c = *s & 0xFF; if( !d->l.n ) { /* what we have seen up to this point resulted in a rejecting match. There are no more states to transition to */ if (!epos) { /* and we never saw an accepting state on our way there, so reset the atch start position and restart with the initial state +*/ spos = s; d = start; } else { /* on our way we encountered an accepting stat +e, therefore we are done now */ break; } } else if (d->accept) { /* we are in an accepting state, advance the end point of the match, and then continue to see if there is a longer match */ epos = s; } if((next = d->next[c]) == NULL) next = nextstate(d, c); d = next; } if (ismatch(&d->l)) { epos = s; } if (epos) { printf("Start: %d End: %d | %.*s\n", spos-realstart, epos-realstart, epos-spos, spos); return 1; } return 0; }

Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

---
$world=~s/war/peace/g

Replies are listed 'Best First'.
Re^7: Perl regexp matching is slow??
by rsc (Acolyte) on Feb 04, 2007 at 20:24 UTC
    I changed the files posted on my web site to allow redistribution under the MIT license. Feel free to move the terms back to the top of the file if you prefer; I like being able to see the source code when I open a file instead of a license, but I know I'm in the minority.

    Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

    I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA.

    Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

    I mentioned the Plan 9 regexp library in connection with Unicode mainly because it was the first to support Unicode and UTF-8, and because it does so fairly simply. True, it does not implement things like german-sharp-ess rules. You could easily change the loop in regexec.c to take the next UTF-8 sequence if one is there or else take a single byte, if that's what Perl does. I don't know what Perl's semantics here are.

    Anyway, here is a modified match routine for dfa1.c that does floating matches.

    It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer). If the underlying NFA has m states, then in general you need to track m possible start pointers. If the code you posted searches for aaab in the text aaaab, it will not find a match: it will read the aaaa and then start over at b.

    Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

    True, but you can portably index into it if you use char & 0xFF, which is what the code does.

      I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA.

      I guess i inferred it from the discussion of construction costs earlier where you said that modern techniques avoid large construction costs.

      Unicode and UTF-8, and because it does so fairly simply

      Simply, unicode and regexes dont fit too well together. :-)

      It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer).

      Oh f**k. Thanks for straightening me out there. That makes things quite a bit trickier doesnt it?

      True, but you can portably index into it if you use char & 0xFF, which is what the code does.

      Ah right. Again thanks. I wasnt trying to be an arse by mentioning that btw, (although i ended up looking like one), its something that has come up quite a few times in the perl engine, so i thought it was just another example of an easy error to make. Should have expected someone as knowledgable as yourself to have considered it already. Sorry.

      ---
      $world=~s/war/peace/g