in reply to Re^2: Identifying Overlapping Matches in Nucleotide Sequence
in thread Identifying Overlapping Matches in Nucleotide Sequence
G'day roboticus,
[You should be safe to drink coffee while reading this one. :-) ]
Thanks for writing that benchmark. I have previously posted benchmarks (as have others) comparing string functions with regexes: I believe the results are fairly well known. I left writing another one as an optional exercise for the OP.
It was, however, interesting that the various combinations of capturing and case-sensitivity didn't make that much difference (as you noted). I got the same sort of results as you (using Perl v5.26.0):
<ACACGAAGCGCTCGTGTGATTATCT> x_cap_i: 7521 x_cap: 7521 x_i: 7521 x: 7521 poz: 9496 idx: 9496 ovr_cap_i: 9496 ovr_cap: 9496 ovr_i: 9496 ovr +: 9496 Rate ovr_cap ovr ovr_cap_i poz ovr_i x_cap x_cap_i x_i + idx x ovr_cap 5.77/s -- -12% -69% -70% -71% -80% -81% -83% + -92% -92% ovr 6.60/s 14% -- -64% -66% -67% -77% -79% -80% + -91% -91% ovr_cap_i 18.4/s 219% 179% -- -4% -8% -36% -41% -45% + -75% -76% poz 19.1/s 231% 190% 4% -- -5% -34% -39% -43% + -74% -75% ovr_i 20.1/s 248% 204% 9% 5% -- -30% -36% -40% + -73% -74% x_cap 28.8/s 399% 337% 56% 51% 44% -- -7% -14% + -61% -62% x_cap_i 31.2/s 440% 372% 69% 63% 55% 8% -- -7% + -58% -59% x_i 33.3/s 477% 405% 81% 74% 66% 16% 7% -- + -55% -56% idx 74.1/s 1183% 1023% 302% 287% 269% 157% 138% 122% + -- -3% x 76.3/s 1222% 1057% 315% 299% 280% 165% 145% 129% + 3% --
By the way, I thought this was rather clever:
$cnt++ while $pos = 1 + index $str, 'AA', $pos;
Out of interest, I added these two subroutines:
sub lc_x_cap_i { my ($str, $cnt) = @_; ++$cnt while $str =~ /(aa)/gi; return $cnt; } sub lc_x_i { my ($str, $cnt) = @_; ++$cnt while $str =~ /aa/gi; return $cnt; }
Then ran your benchmark code again with only the "lc_*" and "x*" routines. I ran it a few times; this seemed to be a fairly representative result:
<AGGTATGATGGTGTAGAGTAACTAG> x_cap_i: 7506 lc_x_cap_i: 7506 x_cap: 7506 x_i: 7506 lc_x_i: 7506 + x: 7506 Rate lc_x_cap_i x_cap_i x_cap lc_x_i x_i + x lc_x_cap_i 27.7/s -- -1% -12% -13% -14% + -66% x_cap_i 28.0/s 1% -- -11% -12% -13% + -66% x_cap 31.6/s 14% 13% -- -0% -2% + -61% lc_x_i 31.7/s 15% 13% 0% -- -2% + -61% x_i 32.4/s 17% 16% 2% 2% -- + -60% x 81.3/s 193% 190% 157% 156% 151% + --
So clearly "x" (/AA/g) was faster than the rest (being ~80/s whereas the others were ~30/s, in all runs). I'd say the others were too close to call: although there may appear to be a trend, in two runs "x_cap" was the slowest.
— Ken
|
|---|