in reply to Regex, capturing variables vs. speed

Here’s a quick tip: if you do a simple

perl -Mre=debug -le'"some test data here" =~ /your regex here/'

on the command line, you will see a) how the regex engine compiles your pattern, including which optimisations it discovered to be applicable while compiling, and b) which steps it performs during actual matching, including which optimisations actually made a difference. (F.ex., seeing “Match rejected by optimizer” for input you want to reject is ideal, because it means the match failure was detected even before the engine was powered up.)

This is an invaluable aid in understanding the performance characteristics of different patterns, as you can see just what the engine is really doing.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Regex, capturing variables vs. speed
by monarch (Priest) on Nov 02, 2005 at 03:49 UTC
    That's a great tool, and here's something I discovered by playing around. There's a difference between executing:
    perl -Mre=debug -le'"lahblahblahblah" =~ /(.?lah)$1{2}/'
    and
    perl -Mre=debug -le'"lahblahblahblah" =~ /(.?lah)\1{2}/'

    The first one matches on "lahblahblah", and the second one matches on "blahblahblah". Why do you suppose that is?

    The debugging engine appears to see the $1 as an almost literal copy of the regexp in the brackets. Whereas the \1 seems to be looking for whatever was matched by the regexp in the brackets..

    Update: Thanks to Aristotle for pointing out the error in the $1 interpretation.

      The regex engine is not seeing any $1. A $1 in a regex does not mean anything special, it’s just a variable that gets interpolated as a literal string. In your case, $1 is empty when that pattern is compiled, so /(.?lah)$1{2}/ becomes simply /(.?lah){2}/. You can see this if you read the compiler output carefully – the CURLYX[0] {2,2} applies to the parenthesised expression.

      Makeshifts last the longest.