in reply to Regex, capturing variables vs. speed

Following on from the previous replies here is a benchmark demonstrating the performance difference. Note though that with the test string the speed difference is only of the order of two times, not the 10 times described by OP.

use warnings; use strict; use Benchmark qw(cmpthese); my $target = 'This is a string used to test the time required for a gr +eedy match compared to a non-greedy match.'; my $greedy = qr/(\ba\b.*\bstring\b)/; my $non = qr/(\ba\b.*?\bstring\b)/; my ($matchG) = $target =~ $greedy; my ($matchN) = $target =~ $non; die "Matches generate different results\n" if $matchG ne $matchN; cmpthese ( -1, { 'Greedy' => sub {$target =~ $greedy;}, 'Non' => sub {$target =~ $non;} } ); Prints: Rate Greedy Non Greedy 162689/s -- -64% Non 456847/s 181% --

Perl is Huffman encoded by design.

Replies are listed 'Best First'.
Re^2: Regex, capturing variables vs. speed
by albert (Monk) on Oct 30, 2005 at 17:57 UTC
    Thanks to all for feedback. I did get much more than 2x speed increase for greedy vs. not because my line to match is quite long. Taking what I've learned from the thread, I did some comparisons.
    use Benchmark qw/cmpthese/; my $line = 'rs11502186 C/G Chr11 170472 + ncbi_b34 perlegen urn:lsid:p +erlegen.hapmap.org:Protocol:Genotyping_1.0.0:2 urn:lsid:perlegen.hapm +ap.org:Assay:25763.7541533:1 urn:lsid:dcc.hapmap.org:Panel:CEPH-30-tr +ios:1 QC+ GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG + GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG NN + GG GG GG NN GG GG GG GG GG GG GG GG GG GG GG GG GG NN GG GG GG GG NN + GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG + GG GG GG GG GG GG'; cmpthese(-1, { 'Greedy' => sub {$line =~ /(chr.*?)\s.*urn:lsid:(.*?)\s.*p +anel:(.*?):/i}, 'Non' => sub {$line =~ /(chr.*?)\s.*?urn:lsid:(.*?)\s.*?pa +nel:(.*?):/i}, 'Sep' => sub {$line =~ /(chr.*?)\s/i; $line =~ /urn:lsid:(.*?)\s/i; $line =~ /panel:(.*?):/i; }, 'Death_star' => sub {$line =~ /(chr[^\s]+)/i; $line =~ /urn:lsid:([^\s]+)/i; $line =~ /panel:([^:]+)/i; } } );
    Giving the following results:
    Rate Greedy Sep Non Death_star Greedy 8650/s -- -95% -95% -97% Sep 157827/s 1725% -- -6% -41% Non 167020/s 1831% 6% -- -38% Death_star 267963/s 2998% 70% 60% --
    Killing the star is clearly the way to go. Thanks to the Monks which helped me learn something.

    -albert

      Even faster is to use a single match, but to be explicit about what you're looking for, i.e.
      'Better' => sub { $line =~ /(chr\S*).*?urn:lsid:(\S*).*?panel:([^:]*)/i }
      It's always better to write (\S*)\s than (.*?)\s, because you're making it clear to the matching engine exactly what you're looking for (non-space characters in this case).
        Thanks. Speed yet better as you say....
        Rate Sep Non Death_star Better Sep 158510/s -- -10% -23% -65% Non 175363/s 11% -- -15% -61% Death_star 206769/s 30% 18% -- -54% Better 449757/s 184% 156% 118% --
        -albert

      Interesting, there must be Perl differences too. The spread is not as great with Active State Perl v5.8.7. In particular, Better is not as much better.

      Rate Greedy Sep Death_star Non Be +tter Greedy 9965/s -- -90% -94% -94% +-97% Sep 102700/s 931% -- -34% -37% +-66% Death_star 155342/s 1459% 51% -- -5% +-48% Non 164099/s 1547% 60% 6% -- +-46% Better 301485/s 2925% 194% 94% 84% + --

      The results above used OP's benchmark code (with the addition of Better).


      Perl is Huffman encoded by design.