comment on

kcott:

When you mentioned the overhead of capturing and case-insensitivity, it piqued my interest, so I put together a quickie, and it shows:

$ time perl ../ex_pattern_matching.pl
<GGCGATGGCGTTCTAGCGCGTAAAA>
x_cap_i: 7396  x_cap: 7396  x_i: 7396  x: 7396
poz: 9202  idx: 9202  ovr_cap_i: 9202  ovr_cap: 9202  ovr_i: 9202  ovr
+: 9202
            Rate ovr_cap   ovr  poz ovr_cap_i ovr_i x_cap x_cap_i  x_i
+  idx    x
ovr_cap   11.1/s      --  -11% -59%      -60%  -62%  -71%    -73% -73%
+ -91% -92%
ovr       12.4/s     12%    -- -54%      -55%  -57%  -68%    -69% -70%
+ -90% -91%
poz       26.9/s    143%  117%   --       -2%   -7%  -31%    -34% -35%
+ -78% -79%
ovr_cap_i 27.5/s    148%  122%   2%        --   -5%  -29%    -32% -34%
+ -78% -79%
ovr_i     28.8/s    160%  133%   7%        5%    --  -26%    -29% -31%
+ -77% -78%
x_cap     38.8/s    250%  213%  44%       41%   35%    --     -4%  -7%
+ -69% -70%
x_cap_i   40.5/s    266%  227%  51%       48%   41%    4%      --  -3%
+ -67% -69%
x_i       41.6/s    275%  236%  55%       51%   44%    7%      3%   --
+ -66% -68%
idx        123/s   1012%  895% 358%      348%  327%  217%    204% 196%
+   --  -6%
x          131/s   1079%  955% 386%      375%  353%  237%    222% 214%
+   6%   --

real    0m37.721s
user    0m37.656s
sys     0m0.015s
[download]

The entries starting with "x" have the same problem that the OP had: the count was too low since they skip overlapping matches. The idx solution you provided was the clear winner (so long as you want correct results). I was surprised at the interaction between case insenstivity and capturing: turning on either makes it slow, but the combination isn't any slower.

Since the simple (but incorrect) match was as quick as the index solution, I tried to "fix" the overlap by using the pos function on the simple match (the poz entry), but mucking about with the pos function put poz into the realm of the other overlapping match functions.

The code:

#!env perl
#
#   ex_pattern_matching.pl
#
#   Check speed of capturing, case sensitivity, etc.
#
use strict;
use warnings;
use Benchmark 'cmpthese';

sub gen_data {
    my ($len, @alphabet) = @_;
    my $rv = "";
    $rv .= $alphabet[@alphabet * rand] while $len > length $rv;
    return $rv;
}

sub x_cap_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(AA)/gi;
    return $cnt;
}

sub x_cap {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(AA)/g;
    return $cnt;
}

sub x_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /AA/gi;
    return $cnt;
}

sub x {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /AA/g;
    return $cnt;
}

sub ovr_cap_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=(AA))/gi;
    return $cnt;
}

sub ovr_cap {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=(AA))/g;
    return $cnt;
}

sub ovr_i {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=AA)/gi;
    return $cnt;
}

sub ovr {
    my ($str, $cnt) = @_;
    ++$cnt while $str =~ /(?=AA)/g;
    return $cnt;
}

sub idx {
    my ($str, $cnt) = @_;
    my $pos = 0;
    $cnt++ while $pos = 1 + index $str, 'AA', $pos;
    return $cnt;
}

sub poz {
    my ($str, $cnt) = @_;
    $cnt++, pos($str)=pos($str)-1 while $str =~ /AA/g;
    return $cnt;
}

print "<", gen_data(25, A=>C=>T=>'G'), ">\n";

my $long = gen_data(150_000, A=>C=>T=>'G');

print "x_cap_i: ", x_cap_i($long), "  ",
      "x_cap: ", x_cap($long),  "  ",
      "x_i: ", x_i($long), "  ",
      "x: ", x($long), "\n";
print "poz: ", poz($long), "  ",
      "idx: ", idx($long), "  ",
      "ovr_cap_i: ", ovr_cap_i($long), "  ",
      "ovr_cap: ", ovr_cap($long), "  ",
      "ovr_i: ", ovr_i($long), "  ",
      "ovr: ", ovr($long), "\n";

$long = gen_data(1_500_000, A=>C=>T=>'G');

cmpthese(100, {
        "x_cap_i"   => sub { return x_cap_i($long, 0) },
        "x_cap"     => sub { return x_cap($long, 0)  },
        "x_i"       => sub { return x_i($long, 0)  },
        "x"         => sub { return x($long, 0)  },
        "idx"       => sub { return idx($long, 0) },
        "ovr_cap_i" => sub { return ovr_cap_i($long, 0) },
        "ovr_cap"   => sub { return ovr_cap($long, 0) },
        "ovr_i"     => sub { return ovr_i($long, 0) },
        "ovr"       => sub { return ovr($long, 0)  },
        "poz"       => sub { return poz($long, 0)  },
    });
[download]

...roboticus

When your only tool is a hammer, all problems look like your thumb.

In reply to Re^2: Identifying Overlapping Matches in Nucleotide Sequence by roboticus
in thread Identifying Overlapping Matches in Nucleotide Sequence by FIJI42

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.