Multiple Regex's on a Big Sequence

bernanke01 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multiple Regex's on a Big Sequence by ikegami (Patriarch) on Aug 16, 2006 at 06:47 UTC
`index` is usually faster at searching for a constant string than a regexp, but I haven't timed `/g` against its `index` equivalent. Here's the (untested) code: `local $, = "\t"; local $\ = "\n"; foreach my $short (qw( CACGTG GTGCAC )) { my $pos = 0; my $len = length($short); while (($pos = index($sequence, $short, $pos)) >= 0) { print $chr, '+', substr($sequence, $pos, $len); $pos += $len; } }` [download] As an added bonus, replace `$pos += $len;` with `$pos++;` to allow overlapping matches. If you do use regexps, `@-` and `@+` can be used instead of `pos`. Specifically, `substr($sequence, $-[0], $+[0] - $-[0])` will return the matched text. See perlvar.	[reply] [d/l] [select]
Re^2: Multiple Regex's on a Big Sequence by bernanke01 (Beadle) on Aug 16, 2006 at 15:35 UTC
Ahh, I never even thought of index. And it has the nice advantage of being parallelizable across CPUs just like a regex. I'll benchmark the three approaches (index, regex, and combined regex's) and report back.	[reply]
Re: Multiple Regex's on a Big Sequence by borisz (Canon) on Aug 16, 2006 at 07:00 UTC
Combiming the patterns looks easy, And if you need to run diffence code for differnt patterns put them in a hash with the pattern as the key and the coderef as the value. `use Regexp::Assemble; my @search_patterns = ( 'CACGTG', 'GTGCAC' ); my $ra = Regexp::Assemble->new; $ra->add( @search_patterns ); while ( $seq =~ /$ra/g ) { ... }` [download] Boris	[reply] [d/l]
Re^2: Multiple Regex's on a Big Sequence by bernanke01 (Beadle) on Aug 16, 2006 at 15:32 UTC
Ahh, never heard of this module, that's cool! The one thing I should have mentioned is that I'm on a multi-processor system so a reg-ex or index approach might be more parallelizable. I think I should benchmark all the options though.	[reply]
Re: Multiple Regex's on a Big Sequence by lima1 (Curate) on Aug 16, 2006 at 12:35 UTC
do you want to search for (nearly) exact matches in a chromosom? Then you really should use a suffix array implementation. It is O(m + log(n)) where m is query size and n is sequence size. So probably MUCH faster than your solution. /msg me if you are interested and I send you my mature vmatch module.	[reply]
Re: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 16, 2006 at 22:55 UTC
I've done some benchmarking of the three options (not including lima1's suggestion of suffix-trees): sequential regular expressions, sequential index's, and merging the regular expressions together using Regexp::Assemble. I'm not great at writing benchmarks, so perhaps this script is simplistic, and I'd appreciate comments on how to improve it. Nevertheless, here are the results: `Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)` [download] So it appears that even without forking the indexing approach is much faster for exact matches. Here's the benchmarking code I used. use strict; use Bio::SeqIO; use Benchmark; use Regexp::Assemble; use Carp; my $seqio = Bio::SeqIO->new( -file => "chrY.fa.masked", -format => "fasta" ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); my @strings = ( 'CACGTG', 'GTGCAC' ); my $t0 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t1 = new Benchmark; my $t2 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $pos = 0; my $len = length($string); while ( ($pos = index($sequence, $string, $pos)) >= 0 +) { my $val = substr($sequence, $pos, $len); $pos += $len; } } } my $t3 = new Benchmark; my $t4 = new Benchmark; for (1..100) { my $ra = Regexp::Assemble->new(); $ra->add(@strings); while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t5 = new Benchmark; my $td1 = timediff($t1, $t0); my $td2 = timediff($t3, $t2); my $td3 = timediff($t5, $t4); print 'Regexp : ', timestr($td1), "\n"; print 'Index : ', timestr($td2), "\n"; print 'Merged : ', timestr($td3), "\n"; [download]	[reply] [d/l] [select]
Re^2: Multiple Regex's on a Big Sequence - Benchmark by hv (Prior) on Aug 17, 2006 at 11:21 UTC
For the cases where you compare multiple regexps against your target string, it may save time if you also `study($sequence)` before starting the matches. This will do a scan of the sequence to allow subsequent matches to use the Boyer-Moore algorithm - it builds a linked list of the locations of each different character in the sequence, and then takes advantage of the frequency data to pick the rarest character for which to walk the list. Because the main benefit of this approach is about rarity, it may not be a big win for a case like this where the string uses only a 4-character alphabet, and (presumably) uses each character roughly 1/4 of the time; I'd be interested to see how it affects the benchmarks. Hugo	[reply] [d/l]
Re^3: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 18, 2006 at 02:02 UTC
Great idea, I'll add it to the next round of Benchmarks.	[reply]
Re^2: Multiple Regex's on a Big Sequence - Benchmark by borisz (Canon) on Aug 17, 2006 at 10:56 UTC
That looks wrong to me, since therre is no need to constuct the `$ra` 100 times. Try it this way. Also try it with more patterns or do you search only for two? `... my $t4 = new Benchmark; my $ra = Regexp::Assemble->new(); $ra->add(@strings); for (1..100) { while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } }` [download] Boris	[reply] [d/l]
Re^3: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 18, 2006 at 02:01 UTC
Hey, I'll try that. However, I figured that if it were a timing test I'd have to compare 100 repeats of everything, and that included construction of the RA. Otherwise it's an apples-to-oranges comparison where one method had the chance to do up-front work and the others didn't. I'll post benchmarks on that in a few minutes though and we can see if it makes a difference.	[reply]
Re^2: Multiple Regex's on a Big Sequence - Benchmark by BrowserUk (Patriarch) on Aug 17, 2006 at 00:29 UTC
Posting the results would be good. Otherwise people would have to go through the nightmare that is installing Bio::* in order to get them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 17, 2006 at 04:47 UTC
Hi BrowserUk -- is there more to posting the results than giving the summary stats I gave at the top of the node? My benchmark didn't give me any more than that, actually. This is the text that's at the top of the grand-parent that I'm speaking of: `Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)` [download]	[reply] [d/l]
Re^4: Multiple Regex's on a Big Sequence - Benchmark by BrowserUk (Patriarch) on Aug 17, 2006 at 06:06 UTC
Re: Multiple Regex's on a Big Sequence by bernanke01 (Beadle) on Aug 18, 2006 at 02:45 UTC
Here's a second round of benchmarking. I've added two things -- the suggested removal of object creation from the loop and a straight-forward regex with string studying. Here is the code, with the benchmarks at the bottom this time. :) My interpretation is that there isn't an appreciable difference caused by the object instantiation, nor by studying. I suspect the slight difference in run-time are more likely just the result of increased server load while the benchmark script was running. use strict; use Bio::SeqIO; use Benchmark; use Regexp::Assemble; use Carp; my $seqio = Bio::SeqIO->new( -file => "chrY.fa.masked", -format => "fa +sta" ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); my @strings = ( 'CACGTG', 'GTGCAC' ); my $t0 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t1 = new Benchmark; my $t2 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $pos = 0; my $len = length($string); while ( ($pos = index($sequence, $string, $pos)) >= 0 +) { my $val = substr($sequence, $pos, $len); $pos += $len; } } } my $t3 = new Benchmark; my $t4 = new Benchmark; for (1..100) { my $ra = Regexp::Assemble->new(); $ra->add(@strings); while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t5 = new Benchmark; my $t6 = new Benchmark; my $ra = Regexp::Assemble->new(); $ra->add(@strings); for (1..100) { while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t7 = new Benchmark; my $t8 = new Benchmark; for (1..100) { study($sequence); foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t9 = new Benchmark; my $td1 = timediff($t1, $t0); my $td2 = timediff($t3, $t2); my $td3 = timediff($t5, $t4); my $td4 = timediff($t7, $t6); my $td5 = timediff($t9, $t8); print 'Regexp : ', timestr($td1), "\n"; print 'Index : ', timestr($td2), "\n"; print 'Merged object inside loop : ', timestr($td3), "\n"; print 'Merged object out of loop : ', timestr($td4), "\n"; print 'Regexp with studying : ', timestr($td5), "\n"; [download] And the results: `Regexp : 221 wallclock secs (221.08 usr + 0.05 sys + = 221.13 CPU) Index : 22 wallclock secs (21.49 usr + 0.00 sys = + 21.49 CPU) Merged object inside loop : 636 wallclock secs (636.61 usr + 0.21 sys + = 636.82 CPU) Merged object out of loop : 651 wallclock secs (651.06 usr + 0.18 sys + = 651.24 CPU) Regexp with studying : 240 wallclock secs (239.50 usr + 0.27 sys + = 239.77 CPU)` [download]	[reply] [d/l] [select]


Don't ask to ask, just ask
	PerlMonks