Re: Multiple Regex's on a Big Sequence

index is usually faster at searching for a constant string than a regexp, but I haven't timed /g against its index equivalent. Here's the (untested) code:

local $, = "\t";
local $\ = "\n";

foreach my $short (qw(
   CACGTG
   GTGCAC
)) {
   my $pos = 0;
   my $len = length($short);
   while (($pos = index($sequence, $short, $pos)) >= 0) {
      print $chr, '+', substr($sequence, $pos, $len);
      $pos += $len;
   }
}
[download]

As an added bonus, replace $pos += $len; with $pos++; to allow overlapping matches.

If you do use regexps, @- and @+ can be used instead of pos. Specifically, substr($sequence, $-[0], $+[0] - $-[0]) will return the matched text. See perlvar.

Comment on Re: Multiple Regex's on a Big Sequence Select or Download Code

Replies are listed 'Best First'.
Re^2: Multiple Regex's on a Big Sequence by bernanke01 (Beadle) on Aug 16, 2006 at 15:35 UTC
Ahh, I never even thought of index. And it has the nice advantage of being parallelizable across CPUs just like a regex. I'll benchmark the three approaches (index, regex, and combined regex's) and report back.	[reply]