Re: Multiple Regex's on a Big Sequence

I've done some benchmarking of the three options (not including lima1's suggestion of suffix-trees): sequential regular expressions, sequential index's, and merging the regular expressions together using Regexp::Assemble. I'm not great at writing benchmarks, so perhaps this script is simplistic, and I'd appreciate comments on how to improve it. Nevertheless, here are the results:

Regexp : 230 wallclock secs (229.86 usr +  0.06 sys = 229.92 CPU)
Index  : 22 wallclock secs (21.47 usr +  0.01 sys = 21.48 CPU)
Merged : 663 wallclock secs (663.73 usr +  0.17 sys = 663.90 CPU)
[download]

So it appears that even without forking the indexing approach is much faster for exact matches. Here's the benchmarking code I used.

use strict;
use Bio::SeqIO;
use Benchmark;
use Regexp::Assemble;
use Carp;

my $seqio = Bio::SeqIO->new(
        -file => "chrY.fa.masked",
        -format => "fasta"
        );
my $seq = $seqio->next_seq();
my $sequence = $seq->seq();

my @strings = (
        'CACGTG',
        'GTGCAC'
        );

my $t0 = new Benchmark;
for (1..100) {
        foreach my $string (@strings) {
                my $len = length($string);
                while ($sequence =~ /$string/g) { my $val = pos($seque
+nce) - $len; }
                }
        }
my $t1 = new Benchmark;

my $t2 = new Benchmark;
for (1..100) {
        foreach my $string (@strings) {
                my $pos = 0;
                my $len = length($string);
                while ( ($pos = index($sequence, $string, $pos)) >= 0 
+) {
                        my $val = substr($sequence, $pos, $len);
                        $pos += $len;
                        }
                }
        }
my $t3 = new Benchmark;

my $t4 = new Benchmark;
for (1..100) {
        my $ra = Regexp::Assemble->new();
        $ra->add(@strings);
        while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; }
        }
my $t5 = new Benchmark;

my $td1 = timediff($t1, $t0);
my $td2 = timediff($t3, $t2);
my $td3 = timediff($t5, $t4);

print 'Regexp : ', timestr($td1), "\n";
print 'Index  : ', timestr($td2), "\n";
print 'Merged : ', timestr($td3), "\n";
[download]

Comment on Re: Multiple Regex's on a Big Sequence - Benchmark Select or Download Code

Replies are listed 'Best First'.
Re^2: Multiple Regex's on a Big Sequence - Benchmark by hv (Prior) on Aug 17, 2006 at 11:21 UTC
For the cases where you compare multiple regexps against your target string, it may save time if you also `study($sequence)` before starting the matches. This will do a scan of the sequence to allow subsequent matches to use the Boyer-Moore algorithm - it builds a linked list of the locations of each different character in the sequence, and then takes advantage of the frequency data to pick the rarest character for which to walk the list. Because the main benefit of this approach is about rarity, it may not be a big win for a case like this where the string uses only a 4-character alphabet, and (presumably) uses each character roughly 1/4 of the time; I'd be interested to see how it affects the benchmarks. Hugo	[reply] [d/l]
Re^3: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 18, 2006 at 02:02 UTC
Great idea, I'll add it to the next round of Benchmarks.	[reply]
Re^2: Multiple Regex's on a Big Sequence - Benchmark by borisz (Canon) on Aug 17, 2006 at 10:56 UTC
That looks wrong to me, since therre is no need to constuct the `$ra` 100 times. Try it this way. Also try it with more patterns or do you search only for two? `... my $t4 = new Benchmark; my $ra = Regexp::Assemble->new(); $ra->add(@strings); for (1..100) { while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } }` [download] Boris	[reply] [d/l]
Re^3: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 18, 2006 at 02:01 UTC
Hey, I'll try that. However, I figured that if it were a timing test I'd have to compare 100 repeats of everything, and that included construction of the RA. Otherwise it's an apples-to-oranges comparison where one method had the chance to do up-front work and the others didn't. I'll post benchmarks on that in a few minutes though and we can see if it makes a difference.	[reply]
Re^2: Multiple Regex's on a Big Sequence - Benchmark by BrowserUk (Patriarch) on Aug 17, 2006 at 00:29 UTC
Posting the results would be good. Otherwise people would have to go through the nightmare that is installing Bio::* in order to get them. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Multiple Regex's on a Big Sequence - Benchmark by bernanke01 (Beadle) on Aug 17, 2006 at 04:47 UTC
Hi BrowserUk -- is there more to posting the results than giving the summary stats I gave at the top of the node? My benchmark didn't give me any more than that, actually. This is the text that's at the top of the grand-parent that I'm speaking of: `Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)` [download]	[reply] [d/l]
Re^4: Multiple Regex's on a Big Sequence - Benchmark by BrowserUk (Patriarch) on Aug 17, 2006 at 06:06 UTC
Um. Not ...er... not that I can think of. What's the emoticon for embarassment? I'm so used to seeing the results displayed after the code, I went straight to the bottom of the post. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]