in reply to Quickest method for matching

Here is some sample code. Method 1 will be slower but more accurate.

Method 2 should be the quickest possible way (basic method wise) to do it in Perl (based on past experience). We use a regex trick to build a m/(this|that|the other|or whatever)/g and grab all the matches on each line in a 'single' pass using list context matching. We precompile the regex and let the optimiser weave its magic.... We will miss overlaps

For really big files it is MUCH FASTER to use read() and read in about 1MB chunks to process instead of doing it line by line. I wrote a thread on this at Re: Performance Question here. In this example a simple substitution was performed on each chunk giving a throughput of 4MB per second giving you the ability to process 1GB ~ every 4 minutes.

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %seqs; # slurp the file containing the sequences you want to find into a scal +ar # like this # open FILE, $finds or die "Can't open $finds, Perl says $!\n"; # do { local $/; $file = <FILE> } # close FILE; # simulate the file slurp result thusly my $file = 'AAA GGG AAAGGG TTTATAATA AGA ATA TTT'; print "METHOD 1\n\n"; # use a hash of hashes to store compiled regexes and also count (below +) for my $seq (split "\n", $file) { $seqs{$seq}->{'re'} = qr/\Q$seq/; } # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { for my $seq (keys %seqs) { $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g; } } print Dumper \%seqs; print "\n\n\nMETHOD 2\n\n"; # re-read data, need to fix seek bug on DATA filehandle for simulation # also clear %seqs hash.... seek DATA, 0,0; my $bugfix; $bugfix = <DATA> until $bugfix and $bugfix eq "__DATA__\n"; %seqs = (); # generate a regex that searches for all the sequences # sorted according to length to find longest possible matches # note this method will miss overlaps (see Data::Dumper output)..... my $re = join '|', sort {length $b <=> length $a} split "\n", $file; # compile the regex only once using qr $re = qr/($re)/; # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { # get all the matches on each line $seqs{$_}++ for m/$re/g; } print Dumper \%seqs __DATA__ AAAGGGAAA TTTATAATA GGGTTTATA CCCTTTCCC UUUUUUUUU TTTGGGATA

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Replies are listed 'Best First'.
Re: Re: Quickest method for matching
by Anonymous Monk on Aug 07, 2002 at 04:40 UTC
    To All who have offered a suggestion/answer...I am in AWE!!
    I am novic'ish and to process this information will take me sometime I imagine...otherwise I would give some more interesting feedback.

    Cheers

    Dr.J

Re: Re: Quickest method for matching
by sauoq (Abbot) on Aug 07, 2002 at 07:46 UTC
    I've got a few nits to pick with your first method. I'd pick different nits with your second method but, as you point out, your second method doesn't handle overlaps and, because of that, probably doesn't fit the requirements. So I'll focus on your first one.
    while (<DATA>) { for my $seq (keys %seqs) { $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g; } }

    What if the file is one 100MB line? (I'm not a biologist or anything but I am under the impression that strands of DNA are very long.) If the data you are searching through is a 100 megs, you've just gone through 100 megs N times, where N is the number of patterns you are searching for. Additionally, optimizing your regex buys you nothing if you are only searching for the pattern once in one long string of input.

    Ok, maybe I'm wrong about the number of lines. He said it was a "very large file" but he didn't specify much more than that. So, is your method better if you have a file with many shorter lines? Not very. Compiling the regex patterns ahead of time buys you a little. But consider 100 lines, each 1MB long. In that case, you search each of those lines N times. In other words, you again search through all 100MB N times.

    I don't propose that the solution I posted above at Re: Quickest method for matching is at all the best but I suspect it would be noticeably more efficient than your method with real input.

    Of course, there are many optimizations that could be used to improve my code too. I could read in a larger buffer at once, thereby reducing the number of reads. I could stop looking for patterns once they've already been found twice. I could refrain from searching across newline boundaries if I expected my DNA data to be composed of many lines. (Notice that newlines don't break the algorithm, they just create a bit of useless work.) Or I could code it up in C because, in this case, Perl is really only buying me a little convenience.

    -sauoq
    "My two cents aren't worth a dime.";
    
      Hey Guys,

      To be more specific on the "very large file" containing the DNA sequences...there are 27000 sequences and it looks something like this:
      >identifier 1
      ATG...(50ish to 4000 base pairs long)
      >identifier 2
      ATG...etc.
      >identifier 3
      Etc....

      Actually the input file is the larger of the two...it contains much smaller chunks of DNA in the same format. Of these files there are probably 200,000 sequences.
      I have several of these files that I will search against the other "large file"
      Sorry for the lack of detail. I never know how much to give without boring people with useless details.

      Cheers,

      Dr.J

        Of course this makes a great deal of difference.

        If you have more data to search for than data to search through, don't use the method I suggested.

        Given the narrow scope of your problem, there are probably a lot of optimizations you could make.

        • How often will you need to search for the same 200,000 sequences in different input?
        • How often will you need to perform a search on the same 27,000 larger sequences?
        • Do you know the probabality characteristics of your search? Can you expect many hits or very few?
        • What about overlaps?
        • Are any of your shorter search sequences present in any of your larger ones?
        • Is character data the best representation? (As you only use four characters, you might want to look into using bit strings or something instead.)

        If you don't expect to be doing this kind of search often, you might be better off just brute forcing it than trying to optimize it too much.

        Good Luck!

        -sauoq
        "My two cents aren't worth a dime.";