Quickest method for matching

dr_jgbn has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Quickest method for matching by RMGir (Prior) on Aug 06, 2002 at 19:59 UTC
(the previous thread he mentions is "Pattern Matching", not using regex) How many strings are you searching for? Is it just 2 or 3, or is it 200-300? Can the match span lines in the big file? Like this, for instance: `...............AAAAAAAA"ATGGCTC GTGTCCA"AAAAAAAAAAA ...........` [download] That would obviously complicate things... If you only have a manageable number of strings to match, and a match can't occur across lines, I'd suggest something like this (which does use regexes, but if you have multiple strings to match at once, I don't know how to avoid them easily): use Regex::PreSuf; sub superMatch { my ($patternFile, $dataFile, $outFile)=@_; open(PAT,"<$patternFile") or die "Can't open $patternFile, error $ +!"; my @patterns=<PAT>; chomp @patterns; close(PAT); open(OUT, ">>$outFile") or die "Can't open output file $outFile, e +rror $!"; # Regex::PreSuf generates a regex that will match all # of the patterns much more quickly than a naive # join "\|",@patterns will my $re=presuf(@patterns); open(DATA,"<$dataFile") or die "Can't open $dataFile, error $!"; # NO NEED TO READ INTO MEMORY ALL AT ONCE! while(<DATA>) { # only compile regex once if(/$re/o) { # only chomp if we have a match chomp; # capturing matches are slower, so only capture # if one or more matches are present. # might have more than one match in a line! while(/($re)/og) { print OUT "'$1', '$_'"; } } } } [download] I haven't tested this, but it could be a start for you... -- Mike	[reply] [d/l] [select]
Re: Re: Quickest method for matching by dr_jgbn (Beadle) on Aug 06, 2002 at 21:45 UTC
Hey Mike, Thanks for your reply. Not every match will be the same, but there can be 2 to 2000 (or even more) matches. Also, a match will not occur over 2 lines (that was a good question I had not thought of). I will have to test your script tonight. Do you know off hand how to time a script? Thanks again, Dr.J	[reply]
Re: Re: Re: Quickest method for matching by sauoq (Abbot) on Aug 06, 2002 at 22:18 UTC
If you are on a *nix machine, check the manual page for the `time` command. It's cleaner and easier for timing a script than changing the script itself. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: Re: Quickest method for matching by tachyon (Chancellor) on Aug 06, 2002 at 21:53 UTC
If it is a slow script: `my $begin = time(); # do stuff print "I took ", time() - $begin, " seconds\n";` [download] Otherwise see Benchmark made easy for timing faster operations cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Quickest method for matching by sauoq (Abbot) on Aug 06, 2002 at 20:09 UTC
If the file of DNA sequences is very large, you could iterate through it once instead of one time for each string for which you are searching. The idea would be to create a hash. Its keys would be the strings you are searching for and the values would be the number of times you have found that key (initialized to zero, of course.) You would also collect all of the different lengths of those keys. You'll need a buffer long enough to hold the longest key length. Read the file in one character at a time storing the characters in your buffer. Once your buffer is as long as your longest key, you start making comparisons. For each different key length you have, you take the substring of that length which starts at the beginning of your buffer and check to see if the hash has that key. If it does, you increment its value. After you have finished your comparisons, you lop off the first character in your buffer, read the next character in, append it to the end of your buffer, and perform your next round of comparisons. You must continue in this manner past the point where there are no more characters to read in so as not to miss shorter matches near the end of your sequence. You continue lopping characters off the front and making comparisons until your buffer is smaller than your shortest search string. Update: With nothing much to do between two meetings this afternoon, I figured I'd throw together some actual code for this. #!/usr/bin/perl -w use strict; use English; # For pre 5.6 users. use vars qw(%patterns @lengths $buf); my $usage =<<END_USAGE_MESSAGE; Usage: $0 patternfile dnafile [minmatches] patternfile File containing one search string per line. dnafile File containing DNA data. minmatches END_USAGE_MESSAGE die $usage if @ARGV < 2; # check_for_matches - Finds matching patterns in our buffer. # Accesses: %patterns, @lengths, and $buf # Modifies: %patterns # sub check_for_matches { for (@lengths) { my $check = substr($buf,0,$_); $patterns{$check}++ if exists $patterns{$check}; } } # Read in the patterns first. We require that all patterns can fit in +memory. open(PATS, $ARGV[0]) or die "Couldn't open $ARGV[0]: $!"; while (<PATS>) { chomp; $patterns{$_} = 0; } close PATS; # Determine pattern lengths. my %lengths; $lengths{length($_)}++ for keys %patterns; @lengths = sort {$a<=>$b} keys %lengths; my $minlength = $lengths[0]; my $maxlength = $lengths[-1]; # Prepare to sift through the DNA data. open(DNA, $ARGV[1]) or die "Couldn't open $ARGV[1]: $!"; # We will be reading one character at a time. $INPUT_RECORD_SEPARATOR = \1; # Read data, check for matches, resize our buffer. while (my $c = <DNA>) { next if $c eq "\n"; $buf .= $c; next if length($buf) < $maxlength; check_for_matches(); $buf = substr($buf,1); } # We need to continue checking for matches near the end. while ($buf = substr($buf,1) and length($buf) >= $minlength) { check_for_matches(); } # Finally, we output the patterns which matched more than once. for (sort keys %patterns) { print "$_\n" if $patterns{$_} > 1; } [download] A test with a 1MB file of random characters from the set [CGAT] and a pattern file containing 5000 random patterns (chosen from the same set of characters) of random lengths between 1 and 40 characters took 10 minutes and 14 seconds of real time on an old Sun Ultra Enterprise 3000 running at 168 MHz. The same test on a 3500 running at 400 MHz (under moderate load) took about 4 and a half minutes. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: Quickest method for matching by dr_jgbn (Beadle) on Aug 08, 2002 at 04:08 UTC
Hello Sauoq, I appreciate your helpful comments since I am not that savvy of a programmer. What led me to ask for help was the fact that I have to run approx. several million short sequences with lengths between 20 and 30 against the parent sequence file (known as the larger file), as well as the fact that I am actually in the middle of running approx. 100,000 of these short sequences through my basic program...it is still not finished after 3 weeks. So I feel compelled to improve things, otherwise my research on this topic will never get done. Appreciate you help. Dr.J	[reply]
Re: Quickest method for matching by danboo (Beadle) on Aug 06, 2002 at 19:41 UTC
If you find regexes slow and your searching is based on simple substrings, i suggest you use the builtin `index` instead. as for the suggestion of using hashes, i'm not sure what the other person was suggesting, but i can't imagine they'd be faster than arrays for this. assuming your files are so large as to be slow in slurping into an array, i'd suggest not loading the file all at once, but instead iterating line by line and testing as you read. if you find your 2 matches, `seek` back to the beginning of the file and go on to the next substring. - danboo	[reply] [d/l] [select]
Re: Quickest method for matching by tachyon (Chancellor) on Aug 06, 2002 at 22:15 UTC
Here is some sample code. Method 1 will be slower but more accurate. Method 2 should be the quickest possible way (basic method wise) to do it in Perl (based on past experience). We use a regex trick to build a `m/(this\|that\|the other\|or whatever)/g` and grab all the matches on each line in a 'single' pass using list context matching. We precompile the regex and let the optimiser weave its magic.... We will miss overlaps For really big files it is MUCH FASTER to use read() and read in about 1MB chunks to process instead of doing it line by line. I wrote a thread on this at Re: Performance Question here. In this example a simple substitution was performed on each chunk giving a throughput of 4MB per second giving you the ability to process 1GB ~ every 4 minutes. #!/usr/bin/perl use strict; use warnings; use Data::Dumper; my %seqs; # slurp the file containing the sequences you want to find into a scal +ar # like this # open FILE, $finds or die "Can't open $finds, Perl says $!\n"; # do { local $/; $file = <FILE> } # close FILE; # simulate the file slurp result thusly my $file = 'AAA GGG AAAGGG TTTATAATA AGA ATA TTT'; print "METHOD 1\n\n"; # use a hash of hashes to store compiled regexes and also count (below +) for my $seq (split "\n", $file) { $seqs{$seq}->{'re'} = qr/\Q$seq/; } # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { for my $seq (keys %seqs) { $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g; } } print Dumper \%seqs; print "\n\n\nMETHOD 2\n\n"; # re-read data, need to fix seek bug on DATA filehandle for simulation # also clear %seqs hash.... seek DATA, 0,0; my $bugfix; $bugfix = <DATA> until $bugfix and $bugfix eq "__DATA__\n"; %seqs = (); # generate a regex that searches for all the sequences # sorted according to length to find longest possible matches # note this method will miss overlaps (see Data::Dumper output)..... my $re = join '\|', sort {length $b <=> length $a} split "\n", $file; # compile the regex only once using qr $re = qr/($re)/; # process the big file line by line (use DATA filehandle in simulation +) while (<DATA>) { # get all the matches on each line $seqs{$_}++ for m/$re/g; } print Dumper \%seqs __DATA__ AAAGGGAAA TTTATAATA GGGTTTATA CCCTTTCCC UUUUUUUUU TTTGGGATA [download] cheers tachyon s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print	[reply] [d/l]
Re: Re: Quickest method for matching by Anonymous Monk on Aug 07, 2002 at 04:40 UTC
To All who have offered a suggestion/answer...I am in AWE!! I am novic'ish and to process this information will take me sometime I imagine...otherwise I would give some more interesting feedback. Cheers Dr.J	[reply]
Re: Re: Quickest method for matching by sauoq (Abbot) on Aug 07, 2002 at 07:46 UTC
I've got a few nits to pick with your first method. I'd pick different nits with your second method but, as you point out, your second method doesn't handle overlaps and, because of that, probably doesn't fit the requirements. So I'll focus on your first one. `while (<DATA>) { for my $seq (keys %seqs) { $seqs{$seq}->{'count'}++ for m/$seqs{$seq}->{'re'}/g; } }` [download] What if the file is one 100MB line? (I'm not a biologist or anything but I am under the impression that strands of DNA are very long.) If the data you are searching through is a 100 megs, you've just gone through 100 megs N times, where N is the number of patterns you are searching for. Additionally, optimizing your regex buys you nothing if you are only searching for the pattern once in one long string of input. Ok, maybe I'm wrong about the number of lines. He said it was a "very large file" but he didn't specify much more than that. So, is your method better if you have a file with many shorter lines? Not very. Compiling the regex patterns ahead of time buys you a little. But consider 100 lines, each 1MB long. In that case, you search each of those lines N times. In other words, you again search through all 100MB N times. I don't propose that the solution I posted above at Re: Quickest method for matching is at all the best but I suspect it would be noticeably more efficient than your method with real input. Of course, there are many optimizations that could be used to improve my code too. I could read in a larger buffer at once, thereby reducing the number of reads. I could stop looking for patterns once they've already been found twice. I could refrain from searching across newline boundaries if I expected my DNA data to be composed of many lines. (Notice that newlines don't break the algorithm, they just create a bit of useless work.) Or I could code it up in C because, in this case, Perl is really only buying me a little convenience. -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Re: Re: Quickest method for matching by dr_jgbn (Beadle) on Aug 07, 2002 at 16:06 UTC
Hey Guys, To be more specific on the "very large file" containing the DNA sequences...there are 27000 sequences and it looks something like this: >identifier 1 ATG...(50ish to 4000 base pairs long) >identifier 2 ATG...etc. >identifier 3 Etc.... Actually the input file is the larger of the two...it contains much smaller chunks of DNA in the same format. Of these files there are probably 200,000 sequences. I have several of these files that I will search against the other "large file" Sorry for the lack of detail. I never know how much to give without boring people with useless details. Cheers, Dr.J	[reply]
Re: Re: Re: Re: Quickest method for matching by sauoq (Abbot) on Aug 07, 2002 at 18:29 UTC