Re^2: Identifying Overlapping Matches in Nucleotide Sequence

kcott:

When you mentioned the overhead of capturing and case-insensitivity, it piqued my interest, so I put together a quickie, and it shows:

$ time perl ../ex_pattern_matching.pl
<GGCGATGGCGTTCTAGCGCGTAAAA>
x_cap_i: 7396  x_cap: 7396  x_i: 7396  x: 7396
poz: 9202  idx: 9202  ovr_cap_i: 9202  ovr_cap: 9202  ovr_i: 9202  ovr
+: 9202
            Rate ovr_cap   ovr  poz ovr_cap_i ovr_i x_cap x_cap_i  x_i
+  idx    x
ovr_cap   11.1/s      --  -11% -59%      -60%  -62%  -71%    -73% -73%
+ -91% -92%
ovr       12.4/s     12%    -- -54%      -55%  -57%  -68%    -69% -70%
+ -90% -91%
poz       26.9/s    143%  117%   --       -2%   -7%  -31%    -34% -35%
+ -78% -79%
ovr_cap_i 27.5/s    148%  122%   2%        --   -5%  -29%    -32% -34%
+ -78% -79%
ovr_i     28.8/s    160%  133%   7%        5%    --  -26%    -29% -31%
+ -77% -78%
x_cap     38.8/s    250%  213%  44%       41%   35%    --     -4%  -7%
+ -69% -70%
x_cap_i   40.5/s    266%  227%  51%       48%   41%    4%      --  -3%
+ -67% -69%
x_i       41.6/s    275%  236%  55%       51%   44%    7%      3%   --
+ -66% -68%
idx        123/s   1012%  895% 358%      348%  327%  217%    204% 196%
+   --  -6%
x          131/s   1079%  955% 386%      375%  353%  237%    222% 214%
+   6%   --

real    0m37.721s
user    0m37.656s
sys     0m0.015s
[download]

The entries starting with "x" have the same problem that the OP had: the count was too low since they skip overlapping matches. The idx solution you provided was the clear winner (so long as you want correct results). I was surprised at the interaction between case insenstivity and capturing: turning on either makes it slow, but the combination isn't any slower.

Since the simple (but incorrect) match was as quick as the index solution, I tried to "fix" the overlap by using the pos function on the simple match (the poz entry), but mucking about with the pos function put poz into the realm of the other overlapping match functions.

The code:

#!env perl
#
#   ex_pattern_matching.pl
#
#   Check speed of capturing, case sensitivity, etc.
#
use strict;
use warnings;
use Benchmark 'cmpthese';

sub gen_data {
    my ($len, @alphabet) = @_;
    my $rv = "";
    $rv .= $alphabet[@alphabet * rand] while $len > length $rv;
    return $rv;
}

sub x_cap_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(AA)/gi;
    return $cnt;
}

sub x_cap {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(AA)/g;
    return $cnt;
}

sub x_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /AA/gi;
    return $cnt;
}

sub x {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /AA/g;
    return $cnt;
}

sub ovr_cap_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=(AA))/gi;
    return $cnt;
}

sub ovr_cap {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=(AA))/g;
    return $cnt;
}

sub ovr_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=AA)/gi;
    return $cnt;
}

sub ovr {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=AA)/g;
    return $cnt;
}

sub idx {
    my ($str, $cnt) = @_;
    my $pos = 0;
    $cnt++ while $pos = 1 + index $str, 'AA', $pos;
    return $cnt;
}

sub poz {
    my ($str, $cnt) = @_;
    $cnt++, pos($str)=pos($str)-1 while $str =~ /AA/g;
    return $cnt;
}

print "<", gen_data(25, A=>C=>T=>'G'), ">\n";

my $long = gen_data(150_000, A=>C=>T=>'G');

print "x_cap_i: ", x_cap_i($long), "  ",
      "x_cap: ", x_cap($long),  "  ",
      "x_i: ", x_i($long), "  ",
      "x: ", x($long), "\n";
print "poz: ", poz($long), "  ",
      "idx: ", idx($long), "  ",
      "ovr_cap_i: ", ovr_cap_i($long), "  ",
      "ovr_cap: ", ovr_cap($long), "  ",
      "ovr_i: ", ovr_i($long), "  ",
      "ovr: ", ovr($long), "\n";

$long = gen_data(1_500_000, A=>C=>T=>'G');

cmpthese(100, {
        "x_cap_i"   => sub { return x_cap_i($long, 0) },
        "x_cap"     => sub { return x_cap($long, 0)  },
        "x_i"       => sub { return x_i($long, 0)  },
        "x"         => sub { return x($long, 0)  },
        "idx"       => sub { return idx($long, 0) },
        "ovr_cap_i" => sub { return ovr_cap_i($long, 0) },
        "ovr_cap"   => sub { return ovr_cap($long, 0) },
        "ovr_i"     => sub { return ovr_i($long, 0) },
        "ovr"       => sub { return ovr($long, 0)  },
        "poz"       => sub { return poz($long, 0)  },
    });
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Comment on Re^2: Identifying Overlapping Matches in Nucleotide Sequence Select or Download Code

Replies are listed 'Best First'.
Re^3: Identifying Overlapping Matches in Nucleotide Sequence by kcott (Archbishop) on Oct 27, 2017 at 21:12 UTC
G'day roboticus, [You should be safe to drink coffee while reading this one. :-) ] Thanks for writing that benchmark. I have previously posted benchmarks (as have others) comparing string functions with regexes: I believe the results are fairly well known. I left writing another one as an optional exercise for the OP. It was, however, interesting that the various combinations of capturing and case-sensitivity didn't make that much difference (as you noted). I got the same sort of results as you (using Perl v5.26.0): <ACACGAAGCGCTCGTGTGATTATCT> x_cap_i: 7521 x_cap: 7521 x_i: 7521 x: 7521 poz: 9496 idx: 9496 ovr_cap_i: 9496 ovr_cap: 9496 ovr_i: 9496 ovr +: 9496 Rate ovr_cap ovr ovr_cap_i poz ovr_i x_cap x_cap_i x_i + idx x ovr_cap 5.77/s -- -12% -69% -70% -71% -80% -81% -83% + -92% -92% ovr 6.60/s 14% -- -64% -66% -67% -77% -79% -80% + -91% -91% ovr_cap_i 18.4/s 219% 179% -- -4% -8% -36% -41% -45% + -75% -76% poz 19.1/s 231% 190% 4% -- -5% -34% -39% -43% + -74% -75% ovr_i 20.1/s 248% 204% 9% 5% -- -30% -36% -40% + -73% -74% x_cap 28.8/s 399% 337% 56% 51% 44% -- -7% -14% + -61% -62% x_cap_i 31.2/s 440% 372% 69% 63% 55% 8% -- -7% + -58% -59% x_i 33.3/s 477% 405% 81% 74% 66% 16% 7% -- + -55% -56% idx 74.1/s 1183% 1023% 302% 287% 269% 157% 138% 122% + -- -3% x 76.3/s 1222% 1057% 315% 299% 280% 165% 145% 129% + 3% -- [download] By the way, I thought this was rather clever: `$cnt++ while $pos = 1 + index $str, 'AA', $pos;` [download] Out of interest, I added these two subroutines: `sub lc_x_cap_i { my ($str, $cnt) = @_; ++$cnt while $str =~ /(aa)/gi; return $cnt; } sub lc_x_i { my ($str, $cnt) = @_; ++$cnt while $str =~ /aa/gi; return $cnt; }` [download] Then ran your benchmark code again with only the "`lc_`" and "`x`" routines. I ran it a few times; this seemed to be a fairly representative result: `<AGGTATGATGGTGTAGAGTAACTAG> x_cap_i: 7506 lc_x_cap_i: 7506 x_cap: 7506 x_i: 7506 lc_x_i: 7506 + x: 7506 Rate lc_x_cap_i x_cap_i x_cap lc_x_i x_i + x lc_x_cap_i 27.7/s -- -1% -12% -13% -14% + -66% x_cap_i 28.0/s 1% -- -11% -12% -13% + -66% x_cap 31.6/s 14% 13% -- -0% -2% + -61% lc_x_i 31.7/s 15% 13% 0% -- -2% + -61% x_i 32.4/s 17% 16% 2% 2% -- + -60% x 81.3/s 193% 190% 157% 156% 151% + --` [download] So clearly "x" (`/AA/g`) was faster than the rest (being ~80/s whereas the others were ~30/s, in all runs). I'd say the others were too close to call: although there may appear to be a trend, in two runs "x_cap" was the slowest. — Ken	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Identifying Overlapping Matches in Nucleotide Sequence
by kcott (Archbishop) on Oct 27, 2017 at 21:12 UTC

G'day roboticus,

[You should be safe to drink coffee while reading this one. :-) ]

Thanks for writing that benchmark. I have previously posted benchmarks (as have others) comparing string functions with regexes: I believe the results are fairly well known. I left writing another one as an optional exercise for the OP.

It was, however, interesting that the various combinations of capturing and case-sensitivity didn't make that much difference (as you noted). I got the same sort of results as you (using Perl v5.26.0):

<ACACGAAGCGCTCGTGTGATTATCT>
x_cap_i: 7521  x_cap: 7521  x_i: 7521  x: 7521
poz: 9496  idx: 9496  ovr_cap_i: 9496  ovr_cap: 9496  ovr_i: 9496  ovr
+: 9496
            Rate ovr_cap   ovr ovr_cap_i  poz ovr_i x_cap x_cap_i  x_i
+  idx    x
ovr_cap   5.77/s      --  -12%      -69% -70%  -71%  -80%    -81% -83%
+ -92% -92%
ovr       6.60/s     14%    --      -64% -66%  -67%  -77%    -79% -80%
+ -91% -91%
ovr_cap_i 18.4/s    219%  179%        --  -4%   -8%  -36%    -41% -45%
+ -75% -76%
poz       19.1/s    231%  190%        4%   --   -5%  -34%    -39% -43%
+ -74% -75%
ovr_i     20.1/s    248%  204%        9%   5%    --  -30%    -36% -40%
+ -73% -74%
x_cap     28.8/s    399%  337%       56%  51%   44%    --     -7% -14%
+ -61% -62%
x_cap_i   31.2/s    440%  372%       69%  63%   55%    8%      --  -7%
+ -58% -59%
x_i       33.3/s    477%  405%       81%  74%   66%   16%      7%   --
+ -55% -56%
idx       74.1/s   1183% 1023%      302% 287%  269%  157%    138% 122%
+   --  -3%
x         76.3/s   1222% 1057%      315% 299%  280%  165%    145% 129%
+   3%   --
[download]

By the way, I thought this was rather clever:

$cnt++ while $pos = 1 + index $str, 'AA', $pos;
[download]

Out of interest, I added these two subroutines:

sub lc_x_cap_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(aa)/gi;
    return $cnt;
}

sub lc_x_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /aa/gi;
    return $cnt;
}
[download]

Then ran your benchmark code again with only the "lc_*" and "x*" routines. I ran it a few times; this seemed to be a fairly representative result:

<AGGTATGATGGTGTAGAGTAACTAG>
x_cap_i: 7506  lc_x_cap_i: 7506  x_cap: 7506  x_i: 7506  lc_x_i: 7506 
+ x: 7506
             Rate lc_x_cap_i   x_cap_i     x_cap    lc_x_i        x_i 
+         x
lc_x_cap_i 27.7/s         --       -1%      -12%      -13%       -14% 
+      -66%
x_cap_i    28.0/s         1%        --      -11%      -12%       -13% 
+      -66%
x_cap      31.6/s        14%       13%        --       -0%        -2% 
+      -61%
lc_x_i     31.7/s        15%       13%        0%        --        -2% 
+      -61%
x_i        32.4/s        17%       16%        2%        2%         -- 
+      -60%
x          81.3/s       193%      190%      157%      156%       151% 
+        --
[download]

So clearly "x" (/AA/g) was faster than the rest (being ~80/s whereas the others were ~30/s, in all runs). I'd say the others were too close to call: although there may appear to be a trend, in two runs "x_cap" was the slowest.

— Ken

[reply]
[d/l]
[select]