Re: Quickest method for matching

If the file of DNA sequences is very large, you could iterate through it once instead of one time for each string for which you are searching.

The idea would be to create a hash. Its keys would be the strings you are searching for and the values would be the number of times you have found that key (initialized to zero, of course.) You would also collect all of the different lengths of those keys. You'll need a buffer long enough to hold the longest key length.

Read the file in one character at a time storing the characters in your buffer. Once your buffer is as long as your longest key, you start making comparisons. For each different key length you have, you take the substring of that length which starts at the beginning of your buffer and check to see if the hash has that key. If it does, you increment its value.

After you have finished your comparisons, you lop off the first character in your buffer, read the next character in, append it to the end of your buffer, and perform your next round of comparisons. You must continue in this manner past the point where there are no more characters to read in so as not to miss shorter matches near the end of your sequence. You continue lopping characters off the front and making comparisons until your buffer is smaller than your shortest search string.

Update: With nothing much to do between two meetings this afternoon, I figured I'd throw together some actual code for this.

#!/usr/bin/perl -w
use strict;
use English;

# For pre 5.6 users.
use vars qw(%patterns @lengths $buf);

my $usage =<<END_USAGE_MESSAGE;
Usage: $0 patternfile dnafile [minmatches]
    patternfile    File containing one search string per line.
    dnafile        File containing DNA data.
    minmatches
END_USAGE_MESSAGE

die $usage if @ARGV < 2;

# check_for_matches - Finds matching patterns in our buffer.
# Accesses: %patterns, @lengths, and $buf
# Modifies: %patterns
#
sub check_for_matches {
    for (@lengths) {
        my $check = substr($buf,0,$_);
        $patterns{$check}++ if exists $patterns{$check};
    }
}


# Read in the patterns first. We require that all patterns can fit in 
+memory.

open(PATS, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!";
while (<PATS>) {
    chomp;
    $patterns{$_} = 0;
}
close PATS;


# Determine pattern lengths.

my %lengths;
$lengths{length($_)}++ for keys %patterns;
@lengths      = sort {$a<=>$b} keys %lengths;
my $minlength = $lengths[0];
my $maxlength = $lengths[-1];


# Prepare to sift through the DNA data.
open(DNA, $ARGV[1]) or die "Couldn't open $ARGV[1]: $!";

# We will be reading one character at a time.
$INPUT_RECORD_SEPARATOR = \1;

# Read data, check for matches, resize our buffer.
while (my $c = <DNA>) {
    next if $c eq "\n";
    $buf .= $c;
    next if length($buf) < $maxlength;
    check_for_matches();
    $buf = substr($buf,1);
}

# We need to continue checking for matches near the end.
while ($buf = substr($buf,1) and length($buf) >= $minlength) {
    check_for_matches();
}

# Finally, we output the patterns which matched more than once.
for (sort keys %patterns) {
    print "$_\n" if $patterns{$_} > 1;
}
[download]

A test with a 1MB file of random characters from the set [CGAT] and a pattern file containing 5000 random patterns (chosen from the same set of characters) of random lengths between 1 and 40 characters took 10 minutes and 14 seconds of real time on an old Sun Ultra Enterprise 3000 running at 168 MHz. The same test on a 3500 running at 400 MHz (under moderate load) took about 4 and a half minutes.

-sauoq
"My two cents aren't worth a dime.";

Comment on Re: Quickest method for matching Download Code