in reply to Regexp Speed Concerns

Want to see how a regular expression is compiled? Perl can tell you!

perldebug:

Debugging regular expressions There are two ways to enable debugging output for regular expressions. If your perl is compiled with -DDEBUGGING, you may use the -Dr flag on the command line. Otherwise, one can use re 'debug', which has effects both at compile time, and at run time (and is not lexically scoped). [snip; read the details in the perldebug documentation.]
The really nice part is that, unlike most of the low-level debugging features, you don't need a DEBUGGING build of Perl to debug your regular expressions.
#!perl -c use re 'debug'; /at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/; /at|bt|ct|dt|et|ft|ght/; /[abcdef]t|ght/; /(?:[abcdef]|gh)t/; /(?:a|b|c|d|e|f|gt)t/; __END__
Here's the output from 5.005_03, reformatted for a little extra clarity:
/at/||/bt/||/ct/||/dt/||/et/||/ft/||/ght/; compiling RE `at' size 3 first at 1 1: EXACT <at>(3) 3: END(0) anchored `at' at 0 (checking anchored isall) minlen 2 compiling RE `bt' size 3 first at 1 1: EXACT <bt>(3) 3: END(0) anchored `bt' at 0 (checking anchored isall) minlen 2 compiling RE `ct' size 3 first at 1 1: EXACT <ct>(3) 3: END(0) anchored `ct' at 0 (checking anchored isall) minlen 2 compiling RE `dt' size 3 first at 1 1: EXACT <dt>(3) 3: END(0) anchored `dt' at 0 (checking anchored isall) minlen 2 compiling RE `et' size 3 first at 1 1: EXACT <et>(3) 3: END(0) anchored `et' at 0 (checking anchored isall) minlen 2 compiling RE `ft' size 3 first at 1 1: EXACT <ft>(3) 3: END(0) anchored `ft' at 0 (checking anchored isall) minlen 2 compiling RE `ght' size 4 first at 1 1: EXACT <ght>(4) 4: END(0) anchored `ght' at 0 (checking anchored isall) minlen 3 /at|bt|ct|dt|et|ft|ght/; compiling RE `at|bt|ct|dt|et|ft|ght' size 23 1: BRANCH(4) 2: EXACT <at>(23) 4: BRANCH(7) 5: EXACT <bt>(23) 7: BRANCH(10) 8: EXACT <ct>(23) 10: BRANCH(13) 11: EXACT <dt>(23) 13: BRANCH(16) 14: EXACT <et>(23) 16: BRANCH(19) 17: EXACT <ft>(23) 19: BRANCH(23) 20: EXACT <ght>(23) 23: END(0) minlen 2 /[abcdef]t|ght/; compiling RE `[abcdef]t|ght' size 18 1: BRANCH(14) 2: ANYOF(12) 12: EXACT <t>(18) 14: BRANCH(18) 15: EXACT <ght>(18) 18: END(0) minlen 2 /(?:[abcdef]|gh)t/; compiling RE `(?:[abcdef]|gh)t' size 18 1: BRANCH(12) 2: ANYOF(16) 12: BRANCH(15) 13: EXACT <gh>(16) 15: TAIL(16) 16: EXACT <t>(18) 18: END(0) minlen 2 /(?:a|b|c|d|e|f|gt)t/; compiling RE `(?:a|b|c|d|e|f|gt)t' size 25 1: BRANCH(4) 2: EXACT <a>(23) 4: BRANCH(7) 5: EXACT <b>(23) 7: BRANCH(10) 8: EXACT <c>(23) 10: BRANCH(13) 11: EXACT <d>(23) 13: BRANCH(16) 14: EXACT <e>(23) 16: BRANCH(19) 17: EXACT <f>(23) 19: BRANCH(22) 20: EXACT <gt>(23) 22: TAIL(23) 23: EXACT <t>(25) 25: END(0) minlen 2
Of course, that's just how the regular expressions are compiled; to see how the regexes are matched -- which is where the optimizer really comes into play -- give the regexes something to match against and watch the output at runtime!

Replies are listed 'Best First'.
Re^2: Regexp Speed Concerns
by tadman (Prior) on Jan 13, 2001 at 03:31 UTC
    That DEBUGGING option is pretty powerful, thanks for the tip. That is proof positive of what I have been suspecting, that Perl will "compile" the regexps, but it doesn't perform any optimizations in any intelligent sense. I was expecting a behaviour like "lex", which takes an "expression" and returns a fairly optimized state-machine which parses the defined tokens.

    I am considering making an optimizer which will tweak a given regexp into something more efficient, such that you could give it something and it would polish it into something known to Benchmark faster. Of course, discovering all the tricks would be a bit of a chore, which is part of the challenge.

    My idea is that this hypothetical function would take a regexp as input and return the optimized, but functionally equivalent, version of same:
    Input Output --------------------------------------- 'x|xy' 'xy?' 'xx|xyx' 'xy?x' 'a|abc' 'a(bc)?' 'a|ab|abc' 'a(b(c)?)?' 'a1|a2|a3' 'a[123]'
    The intention is to add optimization techniques as they are discovered, plus an implementation strategy is devised. Certainly, not every input regexp will be understood to the degree that it can be optimized, and these will be left undisturbed.

    Already, I am finding that the "optimial" solution is somewhat obscure in even simple cases:
    /that|this|thinks|thirst|three/ -> /th(at|i(?:(?:nk)?s|rst)|ree/ ? -> /th(?:at|is|inks|irst|ree)/ ? -> /th(?:at|i(?:nk)?s|irst|ree)/ ?

    Just food for thought.
      Did you miss Re: Regexp Speed Concerns? I pointed there to two links here where there is code for automatic optimization of REs. The third link is to a general purpose optimization which would need to be done within Perl's engine, but which could lead to very good things.

      As for why Perl doesn't do what lex does, the finite state engine that lex produces is a DFA, and so unable to handle backreferences or tell the difference between .* and .*?.