in reply to Re^5: Perl regexp matching is slow??
in thread Perl regexp matching is slow??
Of course, I suppose that it's using a fairly small (a?a) regex, so it's not as illustrative as I had hoped.
Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.
I'd be perfectly happy to rerelease the code under an appropriate license, but I doubt you'd want it
Thanks thats really great. Ive already started work on beefing it up some to handle the Perl syntax and Ive completed some changes that turn it into a floating matcher, which I include below.
Regarding whether we want it, of course we do. :-) For our purposes of trying to shoehorn this stuff into the existing engine its easier to start with a minimal proof of concept than a fullblown engine. Somebody who wants that should do it as a plug in engine replacement. Me, im more concerned with making the out the package engine better and therefore a less polished and simpler framework is in many respects easier to deal with.
Id really like to be able to swap it in on the fly. Alhough TimToady pointed something out that gives me some pause, which is that normal Perl alternation matches leftmost-longest. Which effectively means that without extra syntax (which itself would mean changes to the normal engine as well) we can't swap it in when there is alternation in place. And to make the backtracing engine do longest-token would mean trying every alternation, so adding syntax like (?|...|...) which would be longest token alternation, would mean that if thompsons algorithm was /not/ swapped in the resulting regex would be slower. (Of course if we could detect that the DFA and NFA would match different things then this wouldnt be an issue, so maybe thats the right approach.)Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.
Anyway, here is a modified match routine for dfa1.c that does floating matches. Note this requires extending the Dstate struct to know that it has accepted.
/* Run DFA to determine whether there is a matching string inside of s +. */ int match(DState *start, char *s) { DState *d, *next; int c; char *realstart = s; char *spos = s; char *epos = NULL; d = start; for(; *s; s++){ c = *s & 0xFF; if( !d->l.n ) { /* what we have seen up to this point resulted in a rejecting match. There are no more states to transition to */ if (!epos) { /* and we never saw an accepting state on our way there, so reset the atch start position and restart with the initial state +*/ spos = s; d = start; } else { /* on our way we encountered an accepting stat +e, therefore we are done now */ break; } } else if (d->accept) { /* we are in an accepting state, advance the end point of the match, and then continue to see if there is a longer match */ epos = s; } if((next = d->next[c]) == NULL) next = nextstate(d, c); d = next; } if (ismatch(&d->l)) { epos = s; } if (epos) { printf("Start: %d End: %d | %.*s\n", spos-realstart, epos-realstart, epos-spos, spos); return 1; } return 0; }
Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^7: Perl regexp matching is slow??
by rsc (Acolyte) on Feb 04, 2007 at 20:24 UTC | |
by demerphq (Chancellor) on Feb 04, 2007 at 23:28 UTC |