in reply to Quickest method for matching

(the previous thread he mentions is "Pattern Matching", not using regex)

How many strings are you searching for? Is it just 2 or 3, or is it 200-300?

Can the match span lines in the big file? Like this, for instance:

...............AAAAAAAA"ATGGCTC GTGTCCA"AAAAAAAAAAA ...........
That would obviously complicate things...

If you only have a manageable number of strings to match, and a match can't occur across lines, I'd suggest something like this (which does use regexes, but if you have multiple strings to match at once, I don't know how to avoid them easily):

use Regex::PreSuf; sub superMatch { my ($patternFile, $dataFile, $outFile)=@_; open(PAT,"<$patternFile") or die "Can't open $patternFile, error $ +!"; my @patterns=<PAT>; chomp @patterns; close(PAT); open(OUT, ">>$outFile") or die "Can't open output file $outFile, e +rror $!"; # Regex::PreSuf generates a regex that will match all # of the patterns much more quickly than a naive # join "|",@patterns will my $re=presuf(@patterns); open(DATA,"<$dataFile") or die "Can't open $dataFile, error $!"; # NO NEED TO READ INTO MEMORY ALL AT ONCE! while(<DATA>) { # only compile regex once if(/$re/o) { # only chomp if we have a match chomp; # capturing matches are slower, so only capture # if one or more matches are present. # might have more than one match in a line! while(/($re)/og) { print OUT "'$1', '$_'"; } } } }
I haven't tested this, but it could be a start for you...
--
Mike

Replies are listed 'Best First'.
Re: Re: Quickest method for matching
by dr_jgbn (Beadle) on Aug 06, 2002 at 21:45 UTC
    Hey Mike,
    Thanks for your reply. Not every match will be the same, but there can be 2 to 2000 (or even more) matches.
    Also, a match will not occur over 2 lines (that was a good question I had not thought of).

    I will have to test your script tonight. Do you know off hand how to time a script?

    Thanks again,
    Dr.J

      If you are on a *nix machine, check the manual page for the time command. It's cleaner and easier for timing a script than changing the script itself.
      -sauoq
      "My two cents aren't worth a dime.";
      

      If it is a slow script:

      my $begin = time(); # do stuff print "I took ", time() - $begin, " seconds\n";

      Otherwise see Benchmark made easy for timing faster operations

      cheers

      tachyon

      s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print