in reply to pattern match, speed problem

Instead of iterating over the probes, build a single regex:
my $re = join '|', @probes; $re = qr{(?:$re)}; # just in case you want to use it # somewhere else, and now it's compiled.

And make sure that you use perl 5.10, which has a special optimization in the regex engine for alternations of fixed strings.

Note that this won't find overlapping match results, if you need them, you have to fiddle with pos (and perhaps \G in the regex) manually.

It will also modify the order in which the matches are returned, but if you need the original order very badly, you can find ways around it.

Replies are listed 'Best First'.
Re^2: pattern match, speed problem
by BrowserUk (Patriarch) on Feb 20, 2008 at 13:13 UTC

    It's a seductive idea, but it doesn't seem to scale. With 1000 probes and 30MB it take sunder 2 seconds. With 10,000 probes it takes 22 seconds; 11,000 - 23 secs; 12,000 - 24 secs. Everything looks good. But then with 13,000 I left it running for nearly 10 minutes and it still had not completed and the output seemed to have redcued to a trickle?

    I also tried 15,000 & 20,000 in case I had hit some odd pathelogical case, but both never seem to complete?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Did you try it with perl 5.10 or newer? (sorry, don't have one here available to test it).

      Anyway, you could join just 1000 probes into one regex, and repeat the process until all probes are tested against. That's certainly much faster than single runs.

      Update: I thougt a bit more about it, and got the following idea: if you have many probes, and use a fixed number of probes per regex, you can try to sort the probes before building the buckets. That way you can play to the strengths of the re engine and its trie optimization.

        Did you try it with perl 5.10 or newer?

        Yes, as1000/v5.10. As you can see below, the 14000 (last) run is aborted after 16 minutes elapsed and 15+ minutes of cpu. I'll leave it running while I sleep and see what happens.

        #! perl -slw use strict; use constant { CHROM => 'chromosome.txt', PROBES => 'probes.txt', }; our $N ||= 1000; my $chrom; open C, '<', CHROM or die CHROM . ": $!"; sysread C, $chrom, -s( CHROM ) or die $!; close C; $chrom = uc $chrom; open P, '<', PROBES or die $!; my @probes = <P>; close P; chomp @probes; warn "Trying $N probes with perl $]"; warn time; my $re = join '|', @probes[ 0 .. $N ]; $re = qr[($re)]; print "$1 : $-[ 0 ]" while $chrom =~ m[$re]g; warn time; __END__ [13:47:01.65]C:\test>\Perl510\bin\perl5.10.0.exe -s 668954.p10 -N=1000 + > nul Trying 1000 probes with perl 5.010000 at 668954.p10 line 21. 1203515291 at 668954.p10 line 23. 1203515295 at 668954.p10 line 29. [13:48:16.15]C:\test>\Perl510\bin\perl5.10.0.exe -s 668954.p10 -N=1200 +0 > nul Trying 12000 probes with perl 5.010000 at 668954.p10 line 21. 1203515310 at 668954.p10 line 23. 1203515335 at 668954.p10 line 29. [13:50:56.37]C:\test>\Perl510\bin\perl5.10.0.exe -s 668954.p10 -N=1300 +0 > nul Trying 13000 probes with perl 5.010000 at 668954.p10 line 21. 1203515469 at 668954.p10 line 23. 1203515494 at 668954.p10 line 29. [13:51:35.67]C:\test>\Perl510\bin\perl5.10.0.exe -s 668954.p10 -N=1400 +0 > nul Trying 14000 probes with perl 5.010000 at 668954.p10 line 21. 1203515696 at 668954.p10 line 23. Terminating on signal SIGINT(2) ### Over 15 minutes of CPU used by th +is point [14:08:03.32]C:\test>

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.