shadowSage has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks. Got another one to lay on you: In the 'Intermediate Perl' book, it is discussing the Regexp::Assemble module.

After the brief discussion of how it optimizes the expression it says 'If you are using v5.10 or later, Perl already does this for you.'

I did a quick (not exactly comprehensive) script to check what that statement was on about

use 5.020; use Regexp::Assemble; my $re = Regexp::Assemble->new; for('blah', 'bleh', 'chicken', 'woofie', 'snake', 'snakes', 'woofies', + 'chicken-woofy'){ $re->add("\Q$_"); } say $re->re; my $newre = qr/blah|bleh|chicken|woofie|snake|snakes|woofies|chicken-w +oofy/; say $newre;

The result:

(?^:(?:chicken(?:-woofy)?|woofies?|bl[ae]h|snakes?)) (?^u:(blah)|(bleh)|(chicken)|(woofie)|(snake)|(snakes)|(woofies)|(chic +ken-woofy))

What's the deal? Does it just happen internally when /matching|substituting/? Or do I have the wrong idea completely?

Thanks for your time in reading this

Replies are listed 'Best First'.
Re: Optimization of Regexes
by ikegami (Patriarch) on Sep 22, 2014 at 12:44 UTC
    $ perlbrew use 5.8.9 $ perl -Mre=debug -e'qr/a|b|c|d/' Freeing REx: `","' Compiling REx `a|b|c|d' size 13 Got 108 bytes for offset annotations. 1: BRANCH(4) 2: EXACT <a>(13) 4: BRANCH(7) 5: EXACT <b>(13) 7: BRANCH(10) 8: EXACT <c>(13) 10: BRANCH(13) 11: EXACT <d>(13) 13: END(0) minlen 1 Offsets: [13] 0[0] 1[1] 0[0] 2[1] 3[1] 0[0] 4[1] 5[1] 0[0] 6[1] 7[1] 0[0] 8[ +0] Freeing REx: `"a|b|c|d"' $ perlbrew use 5.10.1t $ perl -Mre=debug -e'qr/a|b|c|d/' Compiling REx "a|b|c|d" Final program: 1: TRIEC-EXACT[a-d] (13) <a> <b> <c> <d> 13: END (0) stclass AHOCORASICKC-EXACT[a-d] minlen 1 Freeing REx: "a|b|c|d"

      I get what your saying. The output of the debug is a bit cryptic to my eyes (I am still new) but clearly a difference between the versions.

      Don't suppose the optimized regex gets stored anywhere accessible? (I don't see anything in perlvar, but you never know)

        but clearly a difference between the versions

        Specifically, a difference in the program into which the regex is compiled. As you mentioned, the difference is that the latter is more efficient to execute. It's slower to compile, though.

        Don't suppose the optimized regex gets stored anywhere accessible?

        What? What is it you want?

      Even though the output of the disassembly is cryptic, two things are quite clear:   First, that every regular-expression string is “compiled” into the thing that is actually carried-out by the engine.   Second, that as particular use-cases are identified that can be improved upon, more-efficient pseudoinstructions are devised and the regex-compiler is modified to generate them.

      In the big picture, that ought to be considered, “free sweets.”   Your already-well thought-out algorithm simply runs a few microseconds faster than before.   Generally, disassembly of a complicated regex is most-useful as a way of cofirming that it actually does instruct the computer to do what you meant to say.