PsychoSpunk has asked for the wisdom of the Perl Monks concerning the following question:

I was looking for insight yesterday and I found myself beckoned to the Pattern Matching section of Chapter 2 of the good book (Camel 2E). It was not a matter of needing a refresher of the regex, but simply a matter that sometimes you take solace in something very familiar to prepare yourself for a daunting task.

It was then that the words hit me, "The Perl Engine uses a nondeterministic finite state automaton to find a match. ... If you're cagey, you can write efficient patterns that don't do a lot of silly backtracking."

At that moment I had a "Eureka!" I remembered from my own studies that finite state automata have mathematical rules that allow you to work with them in a somewhat algebraic manner. And during those studies, we discussed that there exists software whose purpose is to reduce these automata to their optimal expressions. I think you may understand everything now.

What I am wondering is simple: Does a regular expression optimizer exist currently? There is no doubt in my mind that if it doesn't, that it is entirely possible to write one. But that is not of importance to me at this moment. If none exist, then I may embark on that journey in the future.

What is acceptable of such a device that spits out optimized regexes? Obviously, I would like to get the best regex possible no matter if I put in a horrible regex or a better version of the same horrible regex. (ie - r1 => ro and r2 => ro if r1 <s>~</s> r2) I don't think that time constraints are entirely important. Of course, I don't want to wait five years for an optimal regex, but if I give it a crappy regex, I expect that I'll have to wait longer than if I gave it a better version of the regex. In other words, it solves your problems based on the step you start it on.

I guess that covers the two basic requirements I would have for a regex optimizer.

ALL HAIL BRAK!!!

Replies are listed 'Best First'.
RE: Regular Expression Optimizer
by jlistf (Monk) on Aug 31, 2000 at 01:21 UTC
    As far as I understand, the ability to backreference in Perl's regular expressions makes this problem difficult at best. NDFA's (Non-Deterministic Finite Automata) can be represented as a set of points with certain lines connecting them. (Note: if you don't understand that, then I'm not going to explain it. Its not that I can't... its that I don't really have time right now. Its fairly complicated.) I read an article recently (can't remember where) describing Perl's regex engine as constructing NDFA's and working through them. The article described backreferencing as placing little recorders in certain places with record, stop and play buttons scattered about. I think this breaks the NDFA analogy and makes the rest of your argument null and void.

    Allright... I was about to start going into a flashback of my programming theory courses (I started thinking about Context-Free Grammars, Turing Machines, regular languages and NP-complete. it was ugly.) but I'll spare everyone that. In short... I think that if this was possible, and reasonably easy, someone would have done it by now. But if you think you can create something that'll do a reasonably good job, go for it and let me know how it goes.

    jeff

    Update: Thank you BlaisePascal for confirming my thoughts (see his point #2).
RE: Regular Expression Optimizer
by BlaisePascal (Monk) on Aug 31, 2000 at 01:23 UTC
    First off, I would guess that for any regular language, there may be several "optimal" REs that recognise it, so if R is recognised by r1 and r2, the "optimised" r1' and r2' are not necessarily identical.

    Second... Perl regexes are NP-Complete. Based on that, it is my -hunch- that optimization of Perl regexes is also hard. Note that Perl regexes are not regular.

    Third (or, actually zeroth)... What do you mean by "optimal"? Shortest? Fastest? Easiest to understand?

      Actually I think it is NP-Hard. Determining that you have a match is polynomial. Determining that you have the match that you are looking for (which is the difference between .* and .*?) is another story entirely... :-)
        You may be right... I know that it's been shown that NP-Complete problems can be reduced (in P time) to a Perl regex match. But I don't think the proofs I've seen looked at .* vs. .*?. The constructions used backreferences.
      To address the zeroth point, if it were possible to have such a machine then optimal would be fastest. The idea is that I can write an easy to understand regex that isn't fast, and likewise I can write a short regex that isn't fast, but I want a regex that is fast whether or not it sacrifices length or understandability.

      For the first comment, I can see both sides equally. If the two suboptimal regexes are equivalent, then it should find two optimal regexes for each, since I figure this is a machine where I put one input in and get one output out. But if these two inputs are equivalent, then there should be a single "most optimal" regex and if there is a difference in the results from putting r1 and r2, then either one isn't fully optimized since there is a single "most optimal", or my argument has a flaw since I assume that there exists a "most optimal" state.

      Admittedly, some time has passed since I was knee deep in language theory, so I may have missed some very critical points that cause a regex optimizer to fail. So, here's a question: If heuristics are employed in the development of this machine (and mind you at this point, the REO is going to remain purely theoretical), does it doom the optimization from the point at which a heuristic is employed? Does this machine require that it can only operate on the input regex with algorithmic manipulations?

      And finally, for the second point, that just rules out the ability to do the problem within a certain context. ALL HAIL BRAK!!!

        I'm beginning to think that "fastest" may be an illusionary goal.

        The execution time of a RE is dependant on both the RE and on the string it is working on (I think... I might have to think some more...). As such, the measure we are interested in is not T(r), but T(r,s).

        I'll define an "optimal RE r for language L" to be an RE r that recognises L and T(r,s) <= T(r',s) for all r' recognising L and all s in L.

        It's possible that there is a language G, such that an r, r' recognising G and s,s' in G exist with the property that T(r,s) <= T(r'',s) and T(r',s') <= T(r'',s') for all r'' recognising G, but T(r,s)<T(r',s) and T(r',s')<T(r,s'). In otherwords, r is the fastest for s but not for s', and vice versa. Which is the "optimal" one for G?

        So "fastest" might not always be possible.

        As for the other point... Assuming that there is always an RE for a language matching my definition of "optimal", then why can't there be two? Why can't we have r != r' but T(r,s) = T(r',s) for all s? Then which should the optimizer return? Or does it matter? I could very easily see R, R' that optimize into r, r' respecively, for the same language. Your argument isn't flawed due to the believe in a "most optimal" RE, just that you believe there is a unique "most optimal" RE for any language.

        But it may be that in traditional REs (not Perl REs, which have...complications) the idea of a "fastest" RE may be nonsensical. The standard machines to match regular languages are DFAs, which have one state transition per input character. Therefore, the running time of any DFA is a function solely of the length of the input string, and not of the RE used to build it. "Optimal" DFA for a language usually refers to numbers of states, not speed.

        What do you mean by your question about the REO machine being able to make only algorithmic manipulations to the input RE? What other sort of manipulations were you thinking of?

Re: Regular Expression Optimizer
by indigo (Scribe) on Aug 31, 2000 at 01:35 UTC
    The "Perl uses a NFSA" statement is not strictly true. So-called regular expressions in Perl accept languages ordinary FSA's cannot.

    /^(.+)\1$/ matches a string iff its second half is the same as its first. It is fairly easy to show (via the Pumping Lemma, or even a simple counting argument) no FSA can match this class of strings.

    So, even if an optimizer existed (and they do), it probably wouldn't be directly applicatible to Perl pattern matching.
RE (tilly) 1: Regular Expression Optimizer
by tilly (Archbishop) on Aug 31, 2000 at 02:18 UTC
    Perl's regular expressions already have a lot of optimizations in them. But if you are interested, have some time, and want to put in some serious thought, here is a link to an interesting discussion.

    Don't mean to toot my horn, my post is just a good entry point into a discussion involving a lot of interesting things.

    There were a few other good discussions in the same timeframe if you are willing to go look for them.