comment on

Here is some sample code. Method 1 will be slower but more accurate.

Method 2 should be the quickest possible way (basic method wise) to do it in Perl (based on past experience). We use a regex trick to build a m/(this|that|the other|or whatever)/g and grab all the matches on each line in a 'single' pass using list context matching. We precompile the regex and let the optimiser weave its magic.... We will miss overlaps

For really big files it is MUCH FASTER to use read() and read in about 1MB chunks to process instead of doing it line by line. I wrote a thread on this at Re: Performance Question here. In this example a simple substitution was performed on each chunk giving a throughput of 4MB per second giving you the ability to process 1GB ~ every 4 minutes.

#!/usr/bin/perl

use strict;
use warnings;
use Data::Dumper;

my %seqs;

# slurp the file containing the sequences you want to find into a scal
+ar
# like this
# open FILE, $finds or die "Can't open $finds, Perl says $!\n";
# do { local $/; $file = <FILE> }
# close FILE;

# simulate the file slurp result thusly
my $file =
'AAA
GGG
AAAGGG
TTTATAATA
AGA
ATA
TTT';

print "METHOD 1\n\n";
# use a hash of hashes to store compiled regexes and also count (below
+)
for my $seq (split "\n", $file) {
    $seqs{$seq}->{'re'} = qr/\Q$seq/;
}

# process the big file line by line (use DATA filehandle in simulation
+)
while (<DATA>) {
    for my $seq (keys %seqs) {
        $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g;
    }
}

print Dumper \%seqs;

print "\n\n\nMETHOD 2\n\n";

# re-read data, need to fix seek bug on DATA filehandle for simulation
# also clear %seqs hash....
seek DATA, 0,0;
my $bugfix;
$bugfix = <DATA> until $bugfix and $bugfix eq "__DATA__\n";
%seqs = ();

# generate a regex that searches for all the sequences
# sorted according to length to find longest possible matches
# note this method will miss overlaps (see Data::Dumper output).....
my $re = join '|', sort {length $b <=> length $a} split "\n", $file;
# compile the regex only once using qr
$re = qr/($re)/;

# process the big file line by line (use DATA filehandle in simulation
+)
while (<DATA>) {
    # get all the matches on each line
    $seqs{$_}++ for m/$re/g;
}

print Dumper \%seqs

__DATA__
AAAGGGAAA
TTTATAATA
GGGTTTATA
CCCTTTCCC
UUUUUUUUU
TTTGGGATA
[download]

cheers

tachyon

s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

In reply to Re: Quickest method for matching by tachyon
in thread Quickest method for matching by dr_jgbn

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.