in reply to Re: How to store large strings?
in thread How to store large strings?
Sorry 'nall, but that is a ludicrous suggestion.
On a file of around 1GB of randomly generated DNA data:
C:\test>dir randDNA.txt 10/06/2012 01:13 1,036,000,000 randDNA.txt C:\test>head randDNA.txt ATAGTAAGGGACCTCAGCGAGTGTCATAAATATAAGTTGTCGAGAGGTAACGATAGACGCCAATCACTTT +TA GCATCCAGGGGGTCAGTGTCCTAGCGACGTGGAACAACGACTACGCTTCGTAGGTCTCACCGTATAGATG +CC CGGGAGGCCTGCAAAGGAGTGAAGGGTAACGCCTGAACCCTTTGGCCTATCTACGTCGAGATTTCTACCG +GA GCGGAGATCTCCCCCCGGATTTCGTCAAATTCTGGAAATAAGTGTAGCAACCGAACGGTATAGCCAGATA +AT GCTCGAGCACACGCGGACGGTCTCAGAAACTAATTTTCTTAAGCTGGAACAGGCAACCAAAGATTTTAGA +TT ATCGGACGTAGCCAGAAGTGCGGATTTACAGCAACGCCTTTCTCAAAAGTTGCCGTCCCGCGGCACTAAT +AC ACCGATATGAAGGCGCTGAAACGATTATGTGTAGTGACGTGCCTTTCAGCGGCTATGGACGCTATCCCCG +CA GTCATGAGTCCAATTTGGGGTTAGCTGAAATAACCTGCTGTCCCCTAAAATTGTCGCATTCAAGCAGGGT +GG CGGGTACACATGCTAGCATCCGGACGCTATAAGGGCTCCCTTAGTAACATTTCCACTTTCTTGATATTTG +TG GGTGCGTTTAACGACGTCATTACTATGAGAGTCGGTATAGCCATCACATAATGACTCGAGCTTACGTCCT +AC
This short script reads that into a single scalar and searches it for a single short sequence and prints out the 15000+ offsets where it is found in just over eleven seconds:
#! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time(); local $/; my $DNA = <>; $DNA =~ tr[\n][]d; my $seek = 'AGAGAGAA'; my $p = 0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; printf STDERR "Took %.3f seconds\n", time() - $start; __END__ C:\test>DNAsearch1 randDNA.txt | wc -l Took 11.281 seconds 15313
This does the same thing using your 100 bytes-at-a-time method:
#! perl -slw use strict; use Time::HiRes qw[ time ]; my $seek = 'AGAGAGAA'; my $start = time(); my $file = shift; open DNA, '<', $file or die $!; my $size = -s( *DNA ); for my $o ( 1 .. $size - 100 ) { read( DNA, my $DNA, 100 ); $DNA =~ tr[\n][]d; my $p =0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; seek( DNA, $o, 0 ); } printf "Took %.3f seconds\n", time() - $start; __END__ [ 1:32:07.24] C:\test>DNAsearch2 randDNA.txt | wc -l 1441865 [ 4:12:28.90] C:\test>
It has been running for 30+ minutes now and I don't expect it to finish anytime soon, so I'll leave it running and report back tomorrow.
Updated above: Over 2 1/2 hours and 1.4 million hits instead of 15,000.
|
|---|