Here is some sample code. Method 1 will be slower but more accurate.
Method 2 should be the quickest possible way (basic method wise) to do it in Perl (based on past experience). We use a regex trick to build a m/(this|that|the other|or whatever)/g and grab all the matches on each line in a 'single' pass using list context matching. We precompile the regex and let the optimiser weave its magic.... We will miss overlaps
For really big files it is MUCH FASTER to use read() and read in about 1MB chunks to process instead of doing it line by line. I wrote a thread on this at Re: Performance Question here. In this example a simple substitution was performed on each chunk giving a throughput of 4MB per second giving you the ability to process 1GB ~ every 4 minutes.
#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %seqs; # slurp the file containing the sequences you want to find into a scal +ar # like this # open FILE, $finds or die "Can't open $finds, Perl says $!\n"; # do { local $/; $file = <FILE> } # close FILE; # simulate the file slurp result thusly my $file = 'AAA GGG AAAGGG TTTATAATA AGA ATA TTT'; print "METHOD 1\n\n"; # use a hash of hashes to store compiled regexes and also count (below +) for my $seq (split "\n", $file) { $seqs{$seq}->{'re'} = qr/\Q$seq/; } # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { for my $seq (keys %seqs) { $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g; } } print Dumper \%seqs; print "\n\n\nMETHOD 2\n\n"; # re-read data, need to fix seek bug on DATA filehandle for simulation # also clear %seqs hash.... seek DATA, 0,0; my $bugfix; $bugfix = <DATA> until $bugfix and $bugfix eq "__DATA__\n"; %seqs = (); # generate a regex that searches for all the sequences # sorted according to length to find longest possible matches # note this method will miss overlaps (see Data::Dumper output)..... my $re = join '|', sort {length $b <=> length $a} split "\n", $file; # compile the regex only once using qr $re = qr/($re)/; # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { # get all the matches on each line $seqs{$_}++ for m/$re/g; } print Dumper \%seqs __DATA__ AAAGGGAAA TTTATAATA GGGTTTATA CCCTTTCCC UUUUUUUUU TTTGGGATA
cheers
tachyon
s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print
In reply to Re: Quickest method for matching
by tachyon
in thread Quickest method for matching
by dr_jgbn
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |