in reply to Regular Expression Optimizer
First off, I would guess that for any regular language, there may be several "optimal" REs that recognise it, so if R is recognised by r1 and r2, the "optimised" r1' and r2' are not necessarily identical.
Second... Perl regexes are NP-Complete. Based on that, it is my -hunch- that optimization of Perl regexes is also hard. Note that Perl regexes are not regular.
Third (or, actually zeroth)... What do you mean by "optimal"? Shortest? Fastest? Easiest to understand?
RE (tilly) 2: Regular Expression Optimizer
by tilly (Archbishop) on Aug 31, 2000 at 02:09 UTC
|
Actually I think it is NP-Hard. Determining that you have
a match is polynomial. Determining that you have the
match that you are looking for (which is the difference
between .* and .*?) is another story entirely... :-) | [reply] |
|
You may be right... I know that it's been shown that NP-Complete problems can be reduced (in P time) to a Perl regex match. But I don't think the proofs I've seen looked at .* vs. .*?. The constructions used backreferences.
| [reply] |
RE: RE: Regular Expression Optimizer
by PsychoSpunk (Hermit) on Aug 31, 2000 at 02:59 UTC
|
To address the zeroth point, if it were possible to have
such a machine then optimal would be fastest. The idea is
that I can write an easy to understand regex that isn't
fast, and likewise I can write a short regex that isn't
fast, but I want a regex that is fast whether or not it
sacrifices length or understandability.
For the first comment, I can see both sides equally. If
the two suboptimal regexes are equivalent, then it should
find two optimal regexes for each, since I figure this is a
machine where I put one input in and get one output out. But
if these two inputs are equivalent, then there should be a
single "most optimal" regex and if there is a difference in
the results from putting r1 and r2, then either one isn't
fully optimized since there is a single "most optimal", or
my argument has a flaw since I assume that there exists a
"most optimal" state.
Admittedly, some time has passed since I was knee deep in
language theory, so I may have missed some very critical
points that cause a regex optimizer to fail. So, here's a
question: If heuristics are employed in the development of
this machine (and mind you at this point, the REO is going
to remain purely theoretical), does it doom the optimization
from the point at which a heuristic is employed? Does this
machine require that it can only operate on the input regex
with algorithmic manipulations?
And finally, for the second point, that just rules out
the ability to do the problem within a certain context.
ALL HAIL BRAK!!! | [reply] |
|
I'm beginning to think that "fastest" may be an illusionary
goal.
The execution time of a RE is dependant on both the RE and
on the string it is working on (I think... I might have
to think some more...). As such, the measure we are
interested in is not T(r), but T(r,s).
I'll define an "optimal RE r for language L" to be an RE r
that recognises L and T(r,s) <= T(r',s) for all r'
recognising L and all s in L.
It's possible that there is a language G, such that an r, r'
recognising G and s,s' in G exist with the property
that T(r,s) <= T(r'',s)
and T(r',s') <= T(r'',s')
for all r'' recognising G, but T(r,s)<T(r',s)
and T(r',s')<T(r,s'). In otherwords, r is the
fastest for s but not for s', and vice versa. Which is the
"optimal" one for G?
So "fastest" might not always be possible.
As for the other point... Assuming that there is always
an RE for a language matching my definition of "optimal",
then why can't there be two? Why can't we have r != r'
but T(r,s) = T(r',s) for all s? Then which should
the optimizer return? Or does it matter? I could very easily
see R, R' that optimize into r, r' respecively, for the same
language. Your argument isn't flawed due to the believe in
a "most optimal" RE, just that you believe there is a unique
"most optimal" RE for any language.
But it may be that in traditional REs (not Perl REs, which have...complications)
the idea of a "fastest" RE may be nonsensical. The standard
machines to match regular languages are DFAs, which have one
state transition per input character. Therefore, the running
time of any DFA is a function solely of the length of the
input string, and not of the RE used to build it. "Optimal"
DFA for a language usually refers to numbers of states, not
speed.
What do you mean by your question about the REO machine being
able to make only algorithmic manipulations to the input RE?
What other sort of manipulations were you thinking of?
| [reply] [d/l] [select] |
|
Actually it isn't so simple.
First of all are we including compilation time in the
performance? A DFA can be implemented with a complex
analysis up front, but faster runtime, or by doing a simple
analysis up front, and then interpret a bitmap of states
at the end.
The first can result in a compiled RE that is exponential
in the size of the expression. The second cannot blow up
at compile time that way, but the size of the RE definitely
affects the runtime performance.
In the real world, either can wind up being faster.
Going beyond that, any smart RE engine (it would be safe
to assume that Perl's is smart) can take advantage of the
fact that long fixed substrings can be located and matched
*faster* than you can scan a string. An RE engine that
is able to notice then use such optimizations is very nice
to have - and this is one of the reasons that simple Perl
scripts can wind up beating custom C programs at text
processing!
(I would wonder if this is an example of a "non-algorithmic"
improvement. :-)
| [reply] |
|
I was getting a bit disillusioned with the possibility of finding the fastest as well. However, as it's stated in Camel 2E, once the expression matches, the Perl Engine quits. So while in general, the execution of the RE is dependent on the RE and the string which it matches (For the sake of this argument, although not true in all cases in real life, I'm saying that anything that doesn't match takes up n amount of time. This is simply done since any failed string will take up the most time for any RE whether optimized or not), there will eventually be a regex such that, for every valid string in the language checked by the regex, it will perform optimally. It will not perform every execution in t seconds for every case, since it depends on the string; but of two equivalent REs (that share the exact same language), this optimized one will always execute faster than the suboptimal one. Fastest may be a pipe dream in all cases, but it could be possible to return the faster of all possibilities. At which point, we're now getting into the limitations of the implementation rather than anything else.
I will concede that there may be a case where two equally optimal REs are available. This is entirely plausible, and I sometimes fall prey to a binary train of thought. Thus, while I've been reading replies to all of this, I've thought more as I go along, and since I did want to spur dialogue, I've let my thoughts wander down the ways everyone else is thinking to see how approachable this could be. Which of course leads me to your final two questions:
Obviously, this whole discussion of language theory has mathematical rules under which it operates. I think everyone here is aware of what a regular expression optimizer runs up against simply by the fact that no one was able to say "Oh, you want to go look at x." So, my question was simply if heuristic manipulations would be allowed to the input RE. Basically, what if during implementation, a few "leap of faith" type heuristics were involved, but they also performed remarkably well. Does that affect the reception of my REO machine into the community since it doesn't always act as expected mathematically?
ALL HAIL BRAK!!!
| [reply] |
|
|