comment on

I wrote the article. Whether you think it applies to Perl depends on some usually-unstated beliefs about regular expressions and how they should be used.

Regular expressions started out as a notation for describing the strings that they match: a(bb)*a says "first an a, then some number of bb, then an a". They didn't say anything about how you had to take an input string and check whether it matched, and in fact there were multiple algorithms used for doing so. As a computational measure, regular expressions describe sets of strings that can be matched in linear time with constant space, using an algorithm that isn't obvious from looking at the regular expression notation.

Perl's regular expressions have come to be viewed as a notation for an algorithm that matches strings. Operators like backreferences started the trend long before Perl, but new operators like recursion and, to a lesser extend, lookahead and lookbehind assertions, certainly continued the trend quite a ways beyond where it had been. These features are most easily defined in terms of the underlying matching, not in terms of the strings they describe. And of course, as everyone loves to say, Perl's regular expressions are not regular expressions anymore. (While this is true, a large fraction of the ones actually written are true regular expressions and could be treated as such.)

The big difference between these two is whether you think regular expressions should be treated as descriptive (here's the effect I want) or prescriptive (here's how I want you to do it). People who accept the existence of so-called pathological regular expressions implicitly treat regular expressions as prescriptive: "we gave you a lot of rope, so of course you can manage to hang yourself." I believe that until one crosses into the not-too-often-visited land of exotic Perl regular expression features like generalized assertions, recursion, and backreferences, it is best to treat regular expressions as descriptive. Then the regular expression engine can implement the expression efficiently no matter what you've typed. Continuing the strained analogy from before, "you can have all the rope you need, but the regex engine is going to do the knot-tying for you once you tell it what you want to do, so there is no way you can hang yourself."

In the prescriptive regex mindset, programmers have to microoptimize their regular expressions, knowing how they get implemented in order to avoid so-called pathological behavior by avoiding potentially-expensive operations except when he is sure (and hopefully correct!) that "they won't be expensive in this particular case." In the descriptive regex mindset, there are no pathological regular expressions. If there is some required substring, then fine, do that as a further optimization. But the programmer can be guaranteed that running a regular expression will not be expensive, unless it uses advanced features like assertions, recursion, or backreferences. Right now * and ? are potentially-expensive features, and they needn't be.

If the Perl programmer is the one writing the regular expressions, then maybe there isn't much benefit to not having to learn how to "optimize" regular expressions. After all, one has already had to learn Perl; learning how to write regular expressions that Perl implements efficiently is not much more work. However, if the Perl programmer isn't the one writing the regular expressions, then there is a lot of benefit to knowing that a regular expression search isn't going to run for years and years. As an example, let's consider a CGI script that is going to accept regular expressions as search keys. The programmer needs to execute those searches, and it would be very useful if Perl could be relied upon to run them all efficiently, or at least to flag the ones that might not run efficiently because they have things like recursion. Perhaps there could be a regex flag that says "run this in linear time, and if you can't, don't run it at all". Then it would be plenty safe to take a regexp from the user and use it to run a search. But with the current Perl regexes, no way. Lest you think the CGI example is contrived, consider Jeffrey Friedl's Japanese-English dictionary, which accepts regular expressions as search terms. The regexp lookup is really slow already (maybe the host machines are overloaded) but I bet it would be very very slow if you gave it a regular expression like ((((.+)+)+)+)+(aaa|bbb). (Cutting off the search isn't quite correct here, since there are some entries that do match aaa, at least if you are searching for English.)

It's great that Perl 5.10 is going to have a hot-swappable regular expression engine, because then maybe someone could build one that handles the true-regular-expression notation in guaranteed linear time and then Perl programmers could ask for it if they wanted a guarantee of predictable behavior. Even better, that would pave a way to having Perl check for the non-regular operators, and if they weren't there, choose the linear-time implementation automatically.

In short, the article is not about making Perl's regex matching 100x faster across all programs. It is about making it more predictable and consistent across all programs. And just to be clear, that is no more incompatible with whatever optimizations you might add (like exact string search) than backtracking is.

In reply to Re: Perl regexp matching is slow?? by rsc
in thread Perl regexp matching is slow?? by smahesh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.