in reply to Multiple Regex's on a Big Sequence

I've done some benchmarking of the three options (not including lima1's suggestion of suffix-trees): sequential regular expressions, sequential index's, and merging the regular expressions together using Regexp::Assemble. I'm not great at writing benchmarks, so perhaps this script is simplistic, and I'd appreciate comments on how to improve it. Nevertheless, here are the results:

Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)

So it appears that even without forking the indexing approach is much faster for exact matches. Here's the benchmarking code I used.

use strict; use Bio::SeqIO; use Benchmark; use Regexp::Assemble; use Carp; my $seqio = Bio::SeqIO->new( -file => "chrY.fa.masked", -format => "fasta" ); my $seq = $seqio->next_seq(); my $sequence = $seq->seq(); my @strings = ( 'CACGTG', 'GTGCAC' ); my $t0 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $len = length($string); while ($sequence =~ /$string/g) { my $val = pos($seque +nce) - $len; } } } my $t1 = new Benchmark; my $t2 = new Benchmark; for (1..100) { foreach my $string (@strings) { my $pos = 0; my $len = length($string); while ( ($pos = index($sequence, $string, $pos)) >= 0 +) { my $val = substr($sequence, $pos, $len); $pos += $len; } } } my $t3 = new Benchmark; my $t4 = new Benchmark; for (1..100) { my $ra = Regexp::Assemble->new(); $ra->add(@strings); while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } } my $t5 = new Benchmark; my $td1 = timediff($t1, $t0); my $td2 = timediff($t3, $t2); my $td3 = timediff($t5, $t4); print 'Regexp : ', timestr($td1), "\n"; print 'Index : ', timestr($td2), "\n"; print 'Merged : ', timestr($td3), "\n";

Replies are listed 'Best First'.
Re^2: Multiple Regex's on a Big Sequence - Benchmark
by hv (Prior) on Aug 17, 2006 at 11:21 UTC

    For the cases where you compare multiple regexps against your target string, it may save time if you also study($sequence) before starting the matches.

    This will do a scan of the sequence to allow subsequent matches to use the Boyer-Moore algorithm - it builds a linked list of the locations of each different character in the sequence, and then takes advantage of the frequency data to pick the rarest character for which to walk the list.

    Because the main benefit of this approach is about rarity, it may not be a big win for a case like this where the string uses only a 4-character alphabet, and (presumably) uses each character roughly 1/4 of the time; I'd be interested to see how it affects the benchmarks.

    Hugo

      Great idea, I'll add it to the next round of Benchmarks.
Re^2: Multiple Regex's on a Big Sequence - Benchmark
by borisz (Canon) on Aug 17, 2006 at 10:56 UTC
    That looks wrong to me, since therre is no need to constuct the $ra 100 times. Try it this way. Also try it with more patterns or do you search only for two?
    ... my $t4 = new Benchmark; my $ra = Regexp::Assemble->new(); $ra->add(@strings); for (1..100) { while ( $sequence =~ /$ra/g) { my $val = pos($sequence) - 6; } }
    Boris

      Hey, I'll try that. However, I figured that if it were a timing test I'd have to compare 100 repeats of everything, and that included construction of the RA. Otherwise it's an apples-to-oranges comparison where one method had the chance to do up-front work and the others didn't. I'll post benchmarks on that in a few minutes though and we can see if it makes a difference.

Re^2: Multiple Regex's on a Big Sequence - Benchmark
by BrowserUk (Patriarch) on Aug 17, 2006 at 00:29 UTC

    Posting the results would be good. Otherwise people would have to go through the nightmare that is installing Bio::* in order to get them.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Hi BrowserUk -- is there more to posting the results than giving the summary stats I gave at the top of the node? My benchmark didn't give me any more than that, actually. This is the text that's at the top of the grand-parent that I'm speaking of:

      Regexp : 230 wallclock secs (229.86 usr + 0.06 sys = 229.92 CPU) Index : 22 wallclock secs (21.47 usr + 0.01 sys = 21.48 CPU) Merged : 663 wallclock secs (663.73 usr + 0.17 sys = 663.90 CPU)

        Um. Not ...er... not that I can think of. What's the emoticon for embarassment?

        I'm so used to seeing the results displayed after the code, I went straight to the bottom of the post.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.