in reply to RE: RE: Regular Expression Optimizer
in thread Regular Expression Optimizer

I'm beginning to think that "fastest" may be an illusionary goal.

The execution time of a RE is dependant on both the RE and on the string it is working on (I think... I might have to think some more...). As such, the measure we are interested in is not T(r), but T(r,s).

I'll define an "optimal RE r for language L" to be an RE r that recognises L and T(r,s) <= T(r',s) for all r' recognising L and all s in L.

It's possible that there is a language G, such that an r, r' recognising G and s,s' in G exist with the property that T(r,s) <= T(r'',s) and T(r',s') <= T(r'',s') for all r'' recognising G, but T(r,s)<T(r',s) and T(r',s')<T(r,s'). In otherwords, r is the fastest for s but not for s', and vice versa. Which is the "optimal" one for G?

So "fastest" might not always be possible.

As for the other point... Assuming that there is always an RE for a language matching my definition of "optimal", then why can't there be two? Why can't we have r != r' but T(r,s) = T(r',s) for all s? Then which should the optimizer return? Or does it matter? I could very easily see R, R' that optimize into r, r' respecively, for the same language. Your argument isn't flawed due to the believe in a "most optimal" RE, just that you believe there is a unique "most optimal" RE for any language.

But it may be that in traditional REs (not Perl REs, which have...complications) the idea of a "fastest" RE may be nonsensical. The standard machines to match regular languages are DFAs, which have one state transition per input character. Therefore, the running time of any DFA is a function solely of the length of the input string, and not of the RE used to build it. "Optimal" DFA for a language usually refers to numbers of states, not speed.

What do you mean by your question about the REO machine being able to make only algorithmic manipulations to the input RE? What other sort of manipulations were you thinking of?

Replies are listed 'Best First'.
RE (tilly) 4: Regular Expression Optimizer
by tilly (Archbishop) on Aug 31, 2000 at 06:12 UTC
    Actually it isn't so simple.

    First of all are we including compilation time in the performance? A DFA can be implemented with a complex analysis up front, but faster runtime, or by doing a simple analysis up front, and then interpret a bitmap of states at the end.

    The first can result in a compiled RE that is exponential in the size of the expression. The second cannot blow up at compile time that way, but the size of the RE definitely affects the runtime performance.

    In the real world, either can wind up being faster.

    Going beyond that, any smart RE engine (it would be safe to assume that Perl's is smart) can take advantage of the fact that long fixed substrings can be located and matched *faster* than you can scan a string. An RE engine that is able to notice then use such optimizations is very nice to have - and this is one of the reasons that simple Perl scripts can wind up beating custom C programs at text processing!

    (I would wonder if this is an example of a "non-algorithmic" improvement. :-)

RE:^4 Regular Expression Optimizer
by PsychoSpunk (Hermit) on Aug 31, 2000 at 08:30 UTC
    I was getting a bit disillusioned with the possibility of finding the fastest as well. However, as it's stated in Camel 2E, once the expression matches, the Perl Engine quits. So while in general, the execution of the RE is dependent on the RE and the string which it matches (For the sake of this argument, although not true in all cases in real life, I'm saying that anything that doesn't match takes up n amount of time. This is simply done since any failed string will take up the most time for any RE whether optimized or not), there will eventually be a regex such that, for every valid string in the language checked by the regex, it will perform optimally. It will not perform every execution in t seconds for every case, since it depends on the string; but of two equivalent REs (that share the exact same language), this optimized one will always execute faster than the suboptimal one. Fastest may be a pipe dream in all cases, but it could be possible to return the faster of all possibilities. At which point, we're now getting into the limitations of the implementation rather than anything else.

    I will concede that there may be a case where two equally optimal REs are available. This is entirely plausible, and I sometimes fall prey to a binary train of thought. Thus, while I've been reading replies to all of this, I've thought more as I go along, and since I did want to spur dialogue, I've let my thoughts wander down the ways everyone else is thinking to see how approachable this could be. Which of course leads me to your final two questions:

    Obviously, this whole discussion of language theory has mathematical rules under which it operates. I think everyone here is aware of what a regular expression optimizer runs up against simply by the fact that no one was able to say "Oh, you want to go look at x." So, my question was simply if heuristic manipulations would be allowed to the input RE. Basically, what if during implementation, a few "leap of faith" type heuristics were involved, but they also performed remarkably well. Does that affect the reception of my REO machine into the community since it doesn't always act as expected mathematically? ALL HAIL BRAK!!!