Sorry 'nall, but that is a ludicrous suggestion.
On a file of around 1GB of randomly generated DNA data:
C:\test>dir randDNA.txt
10/06/2012 01:13 1,036,000,000 randDNA.txt
C:\test>head randDNA.txt
ATAGTAAGGGACCTCAGCGAGTGTCATAAATATAAGTTGTCGAGAGGTAACGATAGACGCCAATCACTTT
+TA
GCATCCAGGGGGTCAGTGTCCTAGCGACGTGGAACAACGACTACGCTTCGTAGGTCTCACCGTATAGATG
+CC
CGGGAGGCCTGCAAAGGAGTGAAGGGTAACGCCTGAACCCTTTGGCCTATCTACGTCGAGATTTCTACCG
+GA
GCGGAGATCTCCCCCCGGATTTCGTCAAATTCTGGAAATAAGTGTAGCAACCGAACGGTATAGCCAGATA
+AT
GCTCGAGCACACGCGGACGGTCTCAGAAACTAATTTTCTTAAGCTGGAACAGGCAACCAAAGATTTTAGA
+TT
ATCGGACGTAGCCAGAAGTGCGGATTTACAGCAACGCCTTTCTCAAAAGTTGCCGTCCCGCGGCACTAAT
+AC
ACCGATATGAAGGCGCTGAAACGATTATGTGTAGTGACGTGCCTTTCAGCGGCTATGGACGCTATCCCCG
+CA
GTCATGAGTCCAATTTGGGGTTAGCTGAAATAACCTGCTGTCCCCTAAAATTGTCGCATTCAAGCAGGGT
+GG
CGGGTACACATGCTAGCATCCGGACGCTATAAGGGCTCCCTTAGTAACATTTCCACTTTCTTGATATTTG
+TG
GGTGCGTTTAACGACGTCATTACTATGAGAGTCGGTATAGCCATCACATAATGACTCGAGCTTACGTCCT
+AC
This short script reads that into a single scalar and searches it for a single short sequence and prints out the 15000+ offsets where it is found in just over eleven seconds: #! perl -slw
use strict;
use Time::HiRes qw[ time ];
my $start = time();
local $/;
my $DNA = <>;
$DNA =~ tr[\n][]d;
my $seek = 'AGAGAGAA';
my $p = 0;
printf "%s found at position %d\n",
$seek, $p while $p = 1+index $DNA, $seek, $p;
printf STDERR "Took %.3f seconds\n", time() - $start;
__END__
C:\test>DNAsearch1 randDNA.txt | wc -l
Took 11.281 seconds
15313
This does the same thing using your 100 bytes-at-a-time method: #! perl -slw
use strict;
use Time::HiRes qw[ time ];
my $seek = 'AGAGAGAA';
my $start = time();
my $file = shift;
open DNA, '<', $file or die $!;
my $size = -s( *DNA );
for my $o ( 1 .. $size - 100 ) {
read( DNA, my $DNA, 100 );
$DNA =~ tr[\n][]d;
my $p =0;
printf "%s found at position %d\n",
$seek, $p while $p = 1+index $DNA, $seek, $p;
seek( DNA, $o, 0 );
}
printf "Took %.3f seconds\n", time() - $start;
__END__
[ 1:32:07.24] C:\test>DNAsearch2 randDNA.txt | wc -l
1441865
[ 4:12:28.90] C:\test>
It has been running for 30+ minutes now and I don't expect it to finish anytime soon, so I'll leave it running and report back tomorrow.
Updated above: Over 2 1/2 hours and 1.4 million hits instead of 15,000.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
|