in reply to How to store large strings?

Since your goal is: "is to find small substrings within large DNA sequences". It is better than you look for the small substrings by dividing the entire genome into many bins of particular size, for example: if you have DNA sequence of 1 million characters then don't read the entire million character at once. You should

1. Bin1: read first 100 characters i.e. (character 0-99 taking into account that indexing in perl starts from 0) at once

2. look for the small substrings and

3. Bin 2: then again read the 1-100 characters and look for small substrings

4. Bin3: read 2-101 characters and so on until you have scanned the entire sequence.

Replies are listed 'Best First'.
Re^2: How to store large strings?
by BrowserUk (Patriarch) on Jun 10, 2012 at 00:31 UTC

    Sorry 'nall, but that is a ludicrous suggestion.

    On a file of around 1GB of randomly generated DNA data:

    C:\test>dir randDNA.txt 10/06/2012 01:13 1,036,000,000 randDNA.txt C:\test>head randDNA.txt ATAGTAAGGGACCTCAGCGAGTGTCATAAATATAAGTTGTCGAGAGGTAACGATAGACGCCAATCACTTT +TA GCATCCAGGGGGTCAGTGTCCTAGCGACGTGGAACAACGACTACGCTTCGTAGGTCTCACCGTATAGATG +CC CGGGAGGCCTGCAAAGGAGTGAAGGGTAACGCCTGAACCCTTTGGCCTATCTACGTCGAGATTTCTACCG +GA GCGGAGATCTCCCCCCGGATTTCGTCAAATTCTGGAAATAAGTGTAGCAACCGAACGGTATAGCCAGATA +AT GCTCGAGCACACGCGGACGGTCTCAGAAACTAATTTTCTTAAGCTGGAACAGGCAACCAAAGATTTTAGA +TT ATCGGACGTAGCCAGAAGTGCGGATTTACAGCAACGCCTTTCTCAAAAGTTGCCGTCCCGCGGCACTAAT +AC ACCGATATGAAGGCGCTGAAACGATTATGTGTAGTGACGTGCCTTTCAGCGGCTATGGACGCTATCCCCG +CA GTCATGAGTCCAATTTGGGGTTAGCTGAAATAACCTGCTGTCCCCTAAAATTGTCGCATTCAAGCAGGGT +GG CGGGTACACATGCTAGCATCCGGACGCTATAAGGGCTCCCTTAGTAACATTTCCACTTTCTTGATATTTG +TG GGTGCGTTTAACGACGTCATTACTATGAGAGTCGGTATAGCCATCACATAATGACTCGAGCTTACGTCCT +AC

    This short script reads that into a single scalar and searches it for a single short sequence and prints out the 15000+ offsets where it is found in just over eleven seconds:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time(); local $/; my $DNA = <>; $DNA =~ tr[\n][]d; my $seek = 'AGAGAGAA'; my $p = 0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; printf STDERR "Took %.3f seconds\n", time() - $start; __END__ C:\test>DNAsearch1 randDNA.txt | wc -l Took 11.281 seconds 15313

    This does the same thing using your 100 bytes-at-a-time method:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; my $seek = 'AGAGAGAA'; my $start = time(); my $file = shift; open DNA, '<', $file or die $!; my $size = -s( *DNA ); for my $o ( 1 .. $size - 100 ) { read( DNA, my $DNA, 100 ); $DNA =~ tr[\n][]d; my $p =0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; seek( DNA, $o, 0 ); } printf "Took %.3f seconds\n", time() - $start; __END__ [ 1:32:07.24] C:\test>DNAsearch2 randDNA.txt | wc -l 1441865 [ 4:12:28.90] C:\test>

    It has been running for 30+ minutes now and I don't expect it to finish anytime soon, so I'll leave it running and report back tomorrow.

    Updated above: Over 2 1/2 hours and 1.4 million hits instead of 15,000.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re^2: How to store large strings?
by CountZero (Bishop) on Jun 10, 2012 at 07:16 UTC
    Why do you shift your window by one character at a time?

    Think it over again. You can do much faster by moving your window to the end of the previous section less the length of the substring to match plus one. That will make sure that any small substring split over the present window and the next, will now be completely in the next window.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics