Re^3: Identifying Overlapping Matches in Nucleotide Sequence

[You should be safe to drink coffee while reading this one. :-) ]

Thanks for writing that benchmark. I have previously posted benchmarks (as have others) comparing string functions with regexes: I believe the results are fairly well known. I left writing another one as an optional exercise for the OP.

It was, however, interesting that the various combinations of capturing and case-sensitivity didn't make that much difference (as you noted). I got the same sort of results as you (using Perl v5.26.0):

<ACACGAAGCGCTCGTGTGATTATCT>
x_cap_i: 7521  x_cap: 7521  x_i: 7521  x: 7521
poz: 9496  idx: 9496  ovr_cap_i: 9496  ovr_cap: 9496  ovr_i: 9496  ovr
+: 9496
            Rate ovr_cap   ovr ovr_cap_i  poz ovr_i x_cap x_cap_i  x_i
+  idx    x
ovr_cap   5.77/s      --  -12%      -69% -70%  -71%  -80%    -81% -83%
+ -92% -92%
ovr       6.60/s     14%    --      -64% -66%  -67%  -77%    -79% -80%
+ -91% -91%
ovr_cap_i 18.4/s    219%  179%        --  -4%   -8%  -36%    -41% -45%
+ -75% -76%
poz       19.1/s    231%  190%        4%   --   -5%  -34%    -39% -43%
+ -74% -75%
ovr_i     20.1/s    248%  204%        9%   5%    --  -30%    -36% -40%
+ -73% -74%
x_cap     28.8/s    399%  337%       56%  51%   44%    --     -7% -14%
+ -61% -62%
x_cap_i   31.2/s    440%  372%       69%  63%   55%    8%      --  -7%
+ -58% -59%
x_i       33.3/s    477%  405%       81%  74%   66%   16%      7%   --
+ -55% -56%
idx       74.1/s   1183% 1023%      302% 287%  269%  157%    138% 122%
+   --  -3%
x         76.3/s   1222% 1057%      315% 299%  280%  165%    145% 129%
+   3%   --
[download]

By the way, I thought this was rather clever:

$cnt++ while $pos = 1 + index $str, 'AA', $pos;
[download]

Out of interest, I added these two subroutines:

sub lc_x_cap_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(aa)/gi;
    return $cnt;
}

sub lc_x_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /aa/gi;
    return $cnt;
}
[download]

Then ran your benchmark code again with only the "lc_*" and "x*" routines. I ran it a few times; this seemed to be a fairly representative result:

<AGGTATGATGGTGTAGAGTAACTAG>
x_cap_i: 7506  lc_x_cap_i: 7506  x_cap: 7506  x_i: 7506  lc_x_i: 7506 
+ x: 7506
             Rate lc_x_cap_i   x_cap_i     x_cap    lc_x_i        x_i 
+         x
lc_x_cap_i 27.7/s         --       -1%      -12%      -13%       -14% 
+      -66%
x_cap_i    28.0/s         1%        --      -11%      -12%       -13% 
+      -66%
x_cap      31.6/s        14%       13%        --       -0%        -2% 
+      -61%
lc_x_i     31.7/s        15%       13%        0%        --        -2% 
+      -61%
x_i        32.4/s        17%       16%        2%        2%         -- 
+      -60%
x          81.3/s       193%      190%      157%      156%       151% 
+        --
[download]

So clearly "x" (/AA/g) was faster than the rest (being ~80/s whereas the others were ~30/s, in all runs). I'd say the others were too close to call: although there may appear to be a trend, in two runs "x_cap" was the slowest.

— Ken

Comment on Re^3: Identifying Overlapping Matches in Nucleotide Sequence Select or Download Code