I changed the files posted on my web site to allow redistribution under the MIT license. Feel free to move the terms back to the top of the file if you prefer; I like being able to see the source code when I open a file instead of a license, but I know I'm in the minority.

Well the scaling issue I was thinking of would be for /a{10000}/ and things like it. You said its not necessary to unroll such a construct but im not really seeing how.

I don't remember saying that. If I did, I misspoke. In general, you can't do any better than to unroll the repetition: a{10000} is going to result in a 10000-state NFA.

Anyway, im still thinking of how to tackle Perls mixed utf8/byte-sequence encoding in a reasonably way. Probably by converting high bit non utf8 to utf8 and then treating the ut8 as a byte stream. This would also facilitate case-insensitive matches. And BTW unless I have misread the Plan-9 engine didnt seem to have very comprehensive unicode matching support, I wouldnt use it as a basis for discussing unicode regex matching, as it doesnt seem to support any of the hairy cases. Unicode matching is /not/ as simple as saying "ok we have a huge character set now", it also means dealing with combining characters and rules like those required for characters like german-sharp-ess, or greek-sigma.

I mentioned the Plan 9 regexp library in connection with Unicode mainly because it was the first to support Unicode and UTF-8, and because it does so fairly simply. True, it does not implement things like german-sharp-ess rules. You could easily change the loop in regexec.c to take the next UTF-8 sequence if one is there or else take a single byte, if that's what Perl does. I don't know what Perl's semantics here are.

Anyway, here is a modified match routine for dfa1.c that does floating matches.

It is not possible to find the boundaries of an unanchored match in a single pass with a fixed amount of storage (like, say, a single start pointer and a state pointer). If the underlying NFA has m states, then in general you need to track m possible start pointers. If the code you posted searches for aaab in the text aaaab, it will not find a match: it will read the aaaa and then start over at b.

Oh, and a little nit, you can't portably use 'char' to index an array of 256 elements. On some platforms char is signed, you need to explicitly use unsigned char when doing such an indexing.

True, but you can portably index into it if you use char & 0xFF, which is what the code does.

In reply to Re^7: Perl regexp matching is slow?? by rsc
in thread Perl regexp matching is slow?? by smahesh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.