Re^7: Perl regexp matching is slow??

I changed the files posted on my web site to allow redistribution under the MIT license. Feel free to move the terms back to the top of the file if you prefer; I like being able to see the source code when I open a file instead of a license, but I know I'm in the minority.

Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA.

Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

I mentioned the Plan 9 regexp library in connection with Unicode mainly because it was the first to support Unicode and UTF-8, and because it does so fairly simply. True, it does not implement things like german-sharp-ess rules. You could easily change the loop in regexec.c to take the next UTF-8 sequence if one is there or else take a single byte, if that's what Perl does. I don't know what Perl's semantics here are.

Anyway, here is a modified match routine for dfa1.c that does floating matches.

It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer). If the underlying NFA has m states, then in general you need to track m possible start pointers. If the code you posted searches for aaab in the text aaaab, it will not find a match: it will read the aaaa and then start over at b.

Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

True, but you can portably index into it if you use char & 0xFF, which is what the code does.

Comment on Re^7: Perl regexp matching is slow?? Select or Download Code

Replies are listed 'Best First'.
Re^8: Perl regexp matching is slow?? by demerphq (Chancellor) on Feb 04, 2007 at 23:28 UTC
I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA. I guess i inferred it from the discussion of construction costs earlier where you said that modern techniques avoid large construction costs. Unicode and UTF-8, and because it does so fairly simply Simply, unicode and regexes dont fit too well together. :-) It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer). Oh f*k. Thanks for straightening me out there. That makes things quite a bit trickier doesnt it? True, but you can portably index into it if you use char & 0xFF, which is what the code does.* Ah right. Again thanks. I wasnt trying to be an arse by mentioning that btw, (although i ended up looking like one), its something that has come up quite a few times in the perl engine, so i thought it was just another example of an easy error to make. Should have expected someone as knowledgable as yourself to have considered it already. Sorry. --- $world=~s/war/peace/g	[reply]

Replies are listed 'Best First'.

Re^8: Perl regexp matching is slow??
by demerphq (Chancellor) on Feb 04, 2007 at 23:28 UTC

I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA.

I guess i inferred it from the discussion of construction costs earlier where you said that modern techniques avoid large construction costs.

Unicode and UTF-8, and because it does so fairly simply

Simply, unicode and regexes dont fit too well together. :-)

It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer).

Oh f**k. Thanks for straightening me out there. That makes things quite a bit trickier doesnt it?

True, but you can portably index into it if you use char & 0xFF, which is what the code does.

Ah right. Again thanks. I wasnt trying to be an arse by mentioning that btw, (although i ended up looking like one), its something that has come up quite a few times in the perl engine, so i thought it was just another example of an easy error to make. Should have expected someone as knowledgable as yourself to have considered it already. Sorry.

---
$world=~s/war/peace/g

[reply]