in reply to Re: Regexp Speed Concerns
in thread Regexp Speed Concerns

That DEBUGGING option is pretty powerful, thanks for the tip. That is proof positive of what I have been suspecting, that Perl will "compile" the regexps, but it doesn't perform any optimizations in any intelligent sense. I was expecting a behaviour like "lex", which takes an "expression" and returns a fairly optimized state-machine which parses the defined tokens.

I am considering making an optimizer which will tweak a given regexp into something more efficient, such that you could give it something and it would polish it into something known to Benchmark faster. Of course, discovering all the tricks would be a bit of a chore, which is part of the challenge.

My idea is that this hypothetical function would take a regexp as input and return the optimized, but functionally equivalent, version of same:
Input Output --------------------------------------- 'x|xy' 'xy?' 'xx|xyx' 'xy?x' 'a|abc' 'a(bc)?' 'a|ab|abc' 'a(b(c)?)?' 'a1|a2|a3' 'a[123]'
The intention is to add optimization techniques as they are discovered, plus an implementation strategy is devised. Certainly, not every input regexp will be understood to the degree that it can be optimized, and these will be left undisturbed.

Already, I am finding that the "optimial" solution is somewhat obscure in even simple cases:
/that|this|thinks|thirst|three/ -> /th(at|i(?:(?:nk)?s|rst)|ree/ ? -> /th(?:at|is|inks|irst|ree)/ ? -> /th(?:at|i(?:nk)?s|irst|ree)/ ?

Just food for thought.

Replies are listed 'Best First'.
Re (tilly) 3: Regexp Speed Concerns
by tilly (Archbishop) on Jan 13, 2001 at 04:24 UTC
    Did you miss Re: Regexp Speed Concerns? I pointed there to two links here where there is code for automatic optimization of REs. The third link is to a general purpose optimization which would need to be done within Perl's engine, but which could lead to very good things.

    As for why Perl doesn't do what lex does, the finite state engine that lex produces is a DFA, and so unable to handle backreferences or tell the difference between .* and .*?.