albert has asked for the wisdom of the Perl Monks concerning the following question:

I have some written some code to capture certain parts of a line into three variables. I've written this in two ways. First, I capture the three relevant variables in a single regex.
my ($chr,$prot,$panel) = /(chr.*?)\s.*urn:lsid:(.*?)\s.*panel:(.*?):/i +;
Alternatively, I capture each with separate regexes.
my ($chr) = /(chr.*?)\s/i; my ($prot) = /urn:lsid:(.*?)\s/i; my ($panel) = /panel:(.*?):/i;
When testing on my source files, I'm seeing the 2nd approach (separate regexes) is much, much faster. Maybe 10x or so, though I lost the benchmarking I did. I don't understand why the 2nd approach should be so much faster.

Can some monk help give me insight into this, and help me understand how to get maximum speed out of regex matching.

-albert

Replies are listed 'Best First'.
Re: Regex, capturing variables vs. speed
by Zaxo (Archbishop) on Oct 29, 2005 at 21:28 UTC

    The three use simpler regexen, so they run faster. A particular drain on your combined regex is the ".*" (dot-star) you use to join the interesting ones. Each will cause the regex to match to the end of the line, then backtrack to find the following fixed text. ".*?" would be a faster joining regex.

    See Ovid's Death to Dot Star!

    After Compline,
    Zaxo

      A particular drain on your combined regex is the ".*" (dot-star) you use to join the interesting ones.

      I wouldn't automatically point fingers at the .* even though, as you say, .*? would probably be better for him. The issue is how his use of .* is combined with his use of .*? when what he really means is \S*. He'd probably see a significant speedup even if switching to \S* was his only change, but incorporating your suggestion makes sense as well.

      So, I'd rewrite

      my ($chr,$prot,$panel) = /(chr.*?)\s.*urn:lsid:(.*?)\s.*panel:(.*?):/i;
      as
      my ($chr, $prot, $panel) = /(chr\S*)\s.*?urn:lsid(\S*)\s.*?panel:([^:] +*):/i;
      (Note that changing the last .*? to [^:]* is a good idea too even if the efficiency gain isn't much.

      Update: Well, reading further down the thread, it seems robin beat me to it.

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: Regex, capturing variables vs. speed
by Ovid (Cardinal) on Oct 29, 2005 at 21:33 UTC

    Without seeing the text you're trying to match, my guess is that the "dot star" is causing backtracking issues. See Death to Dot Star! for more information.

    Update: Heh :) Zaxo beat me to it.

    Cheers,
    Ovid

    New address of my CGI Course.

Re: Regex, capturing variables vs. speed
by GrandFather (Saint) on Oct 30, 2005 at 10:33 UTC

    Following on from the previous replies here is a benchmark demonstrating the performance difference. Note though that with the test string the speed difference is only of the order of two times, not the 10 times described by OP.

    use warnings; use strict; use Benchmark qw(cmpthese); my $target = 'This is a string used to test the time required for a gr +eedy match compared to a non-greedy match.'; my $greedy = qr/(\ba\b.*\bstring\b)/; my $non = qr/(\ba\b.*?\bstring\b)/; my ($matchG) = $target =~ $greedy; my ($matchN) = $target =~ $non; die "Matches generate different results\n" if $matchG ne $matchN; cmpthese ( -1, { 'Greedy' => sub {$target =~ $greedy;}, 'Non' => sub {$target =~ $non;} } ); Prints: Rate Greedy Non Greedy 162689/s -- -64% Non 456847/s 181% --

    Perl is Huffman encoded by design.
      Thanks to all for feedback. I did get much more than 2x speed increase for greedy vs. not because my line to match is quite long. Taking what I've learned from the thread, I did some comparisons.
      use Benchmark qw/cmpthese/; my $line = 'rs11502186 C/G Chr11 170472 + ncbi_b34 perlegen urn:lsid:p +erlegen.hapmap.org:Protocol:Genotyping_1.0.0:2 urn:lsid:perlegen.hapm +ap.org:Assay:25763.7541533:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-tr +ios:1 QC+ GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG + GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN + GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG GG GG NN + GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG + GG GG GG GG GG GG'; cmpthese(-1, { 'Greedy' => sub {$line =~ /(chr.*?)\s.*urn:lsid:(.*?)\s.*p +anel:(.*?):/i}, 'Non' => sub {$line =~ /(chr.*?)\s.*?urn:lsid:(.*?)\s.*?pa +nel:(.*?):/i}, 'Sep' => sub {$line =~ /(chr.*?)\s/i; $line =~ /urn:lsid:(.*?)\s/i; $line =~ /panel:(.*?):/i; }, 'Death_star' => sub {$line =~ /(chr[^\s]+)/i; $line =~ /urn:lsid:([^\s]+)/i; $line =~ /panel:([^:]+)/i; } } );
      Giving the following results:
      Rate Greedy Sep Non Death_star Greedy 8650/s -- -95% -95% -97% Sep 157827/s 1725% -- -6% -41% Non 167020/s 1831% 6% -- -38% Death_star 267963/s 2998% 70% 60% --
      Killing the star is clearly the way to go. Thanks to the Monks which helped me learn something.

      -albert

        Even faster is to use a single match, but to be explicit about what you're looking for, i.e.
        'Better' => sub { $line =~ /(chr\S*).*?urn:lsid:(\S*).*?panel:([^:]*)/i }
        It's always better to write (\S*)\s than (.*?)\s, because you're making it clear to the matching engine exactly what you're looking for (non-space characters in this case).

        Interesting, there must be Perl differences too. The spread is not as great with Active State Perl v5.8.7. In particular, Better is not as much better.

        Rate Greedy Sep Death_star Non Be +tter Greedy 9965/s -- -90% -94% -94% +-97% Sep 102700/s 931% -- -34% -37% +-66% Death_star 155342/s 1459% 51% -- -5% +-48% Non 164099/s 1547% 60% 6% -- +-46% Better 301485/s 2925% 194% 94% 84% + --

        The results above used OP's benchmark code (with the addition of Better).


        Perl is Huffman encoded by design.
Re: Regex, capturing variables vs. speed
by GrandFather (Saint) on Oct 30, 2005 at 10:55 UTC

    There must be something unusual about OP's data. A simple test gives very similar results for the two techniques:

    use warnings; use strict; use Benchmark qw(cmpthese); my $target = 'Some leading text' . 'chrMatchTextForChr ' . 'Some text to be ignored following chr ' . 'urn:lsid:UrnLisdMatchText ' . 'Text to be ignored after urn:lsid match text' . 'panel:Text: Some text that doesn\'t get matched by anything'; die "Different results" if all () ne parts (); cmpthese (-1, {'all' => \&all, 'parts' => \&parts}); sub all { my ($chr,$prot,$panel) = $target =~ /(chr.*?)\s.*urn:lsid:(.*?)\s.*pan +el:(.*?):/i; return "$chr$prot$panel"; } sub parts { my ($chr) = $target =~ /(chr.*?)\s/i; my ($prot) = $target =~ /urn:lsid:(.*?)\s/i; my ($panel) = $target =~ /panel:(.*?):/i; return "$chr$prot$panel"; } Prints: Rate parts all parts 82116/s -- -17% all 98443/s 20% --

    Perl is Huffman encoded by design.

      Maybe the version of perl, and the regex engine, being used by the OP differs?

Re: Regex, capturing variables vs. speed
by Anonymous Monk on Oct 30, 2005 at 14:07 UTC
    Are you sure you're not secretly parsing XML using regex?

      You mean that regex "know" when one is parsing XML and deliberatly and maliciously slow down as a result? That's pretty mean! Perl should at least generate a warning in that situation.


      Perl is Huffman encoded by design.
Re: Regex, capturing variables vs. speed
by Hexren (Initiate) on Oct 31, 2005 at 01:07 UTC
    Although I am far too lazy to check with my oreilly right now I would hazard a guess that ".*" is the crucial point. Matching something like "chr.*?\s" should be fairly easy. Discard anything that does not look like "chr" if you find something that looks like it. See if the next thing is a \s else get whatever it is and look if the thing after that thing is a \s. In comparison looking for ".*" is more expensive as that .* tries to grap as much as possible and it has to play through all the possible combinations to see if it cann match something more. It could be summarized like "Matching greedy things is more expensive than matching non greedy things" Is that understandable (I do hope it is right somewhat ;) ? Regards Hexren
Re: Regex, capturing variables vs. speed
by Aristotle (Chancellor) on Nov 01, 2005 at 21:02 UTC

    Here’s a quick tip: if you do a simple

    perl -Mre=debug -le'"some test data here" =~ /your regex here/'

    on the command line, you will see a) how the regex engine compiles your pattern, including which optimisations it discovered to be applicable while compiling, and b) which steps it performs during actual matching, including which optimisations actually made a difference. (F.ex., seeing “Match rejected by optimizer” for input you want to reject is ideal, because it means the match failure was detected even before the engine was powered up.)

    This is an invaluable aid in understanding the performance characteristics of different patterns, as you can see just what the engine is really doing.

    Makeshifts last the longest.

      That's a great tool, and here's something I discovered by playing around. There's a difference between executing:
      perl -Mre=debug -le'"lahblahblahblah" =~ /(.?lah)$1{2}/'
      and
      perl -Mre=debug -le'"lahblahblahblah" =~ /(.?lah)\1{2}/'

      The first one matches on "lahblahblah", and the second one matches on "blahblahblah". Why do you suppose that is?

      The debugging engine appears to see the $1 as an almost literal copy of the regexp in the brackets. Whereas the \1 seems to be looking for whatever was matched by the regexp in the brackets..

      Update: Thanks to Aristotle for pointing out the error in the $1 interpretation.

        The regex engine is not seeing any $1. A $1 in a regex does not mean anything special, it’s just a variable that gets interpolated as a literal string. In your case, $1 is empty when that pattern is compiled, so /(.?lah)$1{2}/ becomes simply /(.?lah){2}/. You can see this if you read the compiler output carefully – the CURLYX[0] {2,2} applies to the parenthesised expression.

        Makeshifts last the longest.

Re: Regex, capturing variables vs. speed
by demerphq (Chancellor) on Nov 01, 2005 at 14:13 UTC

    I can suggest a rationale for why the seperate regexes might be faster. The seperate regexes only need to use Boyer-Moore matching to find the start position for the constant strings in the patter, and then capture for the appropriate amount following that point. With the unified regex it will only find one of the constant strings (iirc, it will be 'chr') and then needs to match right from that point.

    Other changes to pattern will play a role in the various runtimes so the above is not the only analysis required, but it does give some insight into how the regex engine will handle your pattern.

    ---
    $world=~s/war/peace/g