Here is some sample code. Method 1 will be slower but more accurate.

Method 2 should be the quickest possible way (basic method wise) to do it in Perl (based on past experience). We use a regex trick to build a m/(this|that|the other|or whatever)/g and grab all the matches on each line in a 'single' pass using list context matching. We precompile the regex and let the optimiser weave its magic.... We will miss overlaps

For really big files it is MUCH FASTER to use read() and read in about 1MB chunks to process instead of doing it line by line. I wrote a thread on this at Re: Performance Question here. In this example a simple substitution was performed on each chunk giving a throughput of 4MB per second giving you the ability to process 1GB ~ every 4 minutes.

#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %seqs; # slurp the file containing the sequences you want to find into a scal +ar # like this # open FILE, $finds or die "Can't open $finds, Perl says $!\n"; # do { local $/; $file = <FILE> } # close FILE; # simulate the file slurp result thusly my $file = 'AAA GGG AAAGGG TTTATAATA AGA ATA TTT'; print "METHOD 1\n\n"; # use a hash of hashes to store compiled regexes and also count (below +) for my $seq (split "\n", $file) { $seqs{$seq}->{'re'} = qr/\Q$seq/; } # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { for my $seq (keys %seqs) { $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g; } } print Dumper \%seqs; print "\n\n\nMETHOD 2\n\n"; # re-read data, need to fix seek bug on DATA filehandle for simulation # also clear %seqs hash.... seek DATA, 0,0; my $bugfix; $bugfix = <DATA> until $bugfix and $bugfix eq "__DATA__\n"; %seqs = (); # generate a regex that searches for all the sequences # sorted according to length to find longest possible matches # note this method will miss overlaps (see Data::Dumper output)..... my $re = join '|', sort {length $b <=> length $a} split "\n", $file; # compile the regex only once using qr $re = qr/($re)/; # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { # get all the matches on each line $seqs{$_}++ for m/$re/g; } print Dumper \%seqs __DATA__ AAAGGGAAA TTTATAATA GGGTTTATA CCCTTTCCC UUUUUUUUU TTTGGGATA

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print


In reply to Re: Quickest method for matching by tachyon
in thread Quickest method for matching by dr_jgbn

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.