comment on

Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way.

Doesnt this criticism apply equally to Thompsons algorithm or to DFA construction? I would have thought the only difference would be that in a DFA you'd see performance issues with compilation and not execution.

It depends on the implementation. If you use Thompson's algorithm as implemented in nfa.c from the article, then you end up with a run-time that is O(m*n) for length-m regexp and length-n text.

It is true that if you pre-compile the DFA, then you end up with O(2^m * n) time, but you can build the DFA on the fly and you're back to O(m*n), just with a smaller constant than the nfa.c version.

And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

Sure. But the question is will Construction time + FBM + Verification be faster for an BNFA (backtracking NFA) than for a DFA? And will the DFA consume radically more memory than the BNFA? And my position is that most likely the DFA will win only on degenerate patterns. The rest of the time my feeling is that the BNFA will win hands down, mostly because of how cheap construction is.

This is a myth. Like most myths it is based in a truth that is no longer true. When people precomputed the entire DFA before starting the search, like the original egrep did, construction was in fact pretty expensive. But no one does that anymore. They build out the DFA on the fly, as described in the article. If you build out the DFA on the fly, then the "compilation" or "construction" passes are more or less the same as for backtracking: basically you just have to syntax check the regexp and build some internal representation of a parse tree. If you look at the big graph near the end of the article, you'll see that the nfa.c in the article has cheaper construction cost than all of the fancier implementations, by about a factor of 4. And I wasn't really trying.

Construction costs are what you engineer them to be -- neither approach has any advantage here.

In reply to Re^3: Perl regexp matching is slow?? by rsc
in thread Perl regexp matching is slow?? by smahesh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.