Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi fello Monks!
I am having difficulty with a project that involves large DNA sequences.
My goal is to find small substrings within large DNA sequences, but the problem is that I can't assign the whole sequence to a variable.
I tried the script in 2 different computers and realised that it is a memory problem, because the "slower" one allowed for ~62000 characters to be stored in the string, while the second one, for 67000. But the sequences I have can be much much longer in length...
Is there something else I could try?

Replies are listed 'Best First'.
Re: How to store large strings?
by BrowserUk (Patriarch) on Jun 09, 2012 at 20:05 UTC
    I can't assign the whole sequence to a variable.

    That doesn't make any sense, This shows perl storing 6 billion characters in a single variable:

    [0] Perl> $a = 'x'x 1e6;; [0] Perl> $a x= 6000;; [0] Perl> print length $a;; 6000000000

    So there is more to your problem than you are describing. The best thing you can do, is post the code that is giving you the problem.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: How to store large strings?
by ww (Archbishop) on Jun 09, 2012 at 21:45 UTC
    Lemme' guess...

    Computer 1 is a TRS-80; computer 2 may have a Z80 CPU and run CP/M....

    Congratulations on porting Perl to either... :-)

    Seriously, a quick check of questions posted here suggests that others working with "large DNA sequences" are working with strings f a a a r longer than those you cite. Unless you're working on a shared system with incredibly; no, make that "unbelieveably" tight user-space limits, there's something missing in the picture you've presented so far.

      Computer 1 is a TRS-80; computer 2 may have a Z80 CPU and run CP/M
      Fond memories! A TRS-80 was my first computer, but unless my memory has failed me utterly, it already had a Z80 CPU (and with the "expansion interface" a whopping 48K of RAM).

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: How to store large strings?
by thomas895 (Deacon) on Jun 09, 2012 at 21:39 UTC

    Nonsense! Your computer would have to have some outrageously low memory installed for that to be the case. Unless, of course you are using some ancient machine with 640K of RAM. But even then, I think it should still fit.

    You could try an array as well, just split the string every 5000 characters or so. Although, I don't think that will help much, if at all.
    We'll need to see your code in order to give any further advice.

    ~Thomas~
    confess( "I offer no guarantees on my code." );
Re: How to store large strings?
by snape (Pilgrim) on Jun 09, 2012 at 23:41 UTC

    Since your goal is: "is to find small substrings within large DNA sequences". It is better than you look for the small substrings by dividing the entire genome into many bins of particular size, for example: if you have DNA sequence of 1 million characters then don't read the entire million character at once. You should

    1. Bin1: read first 100 characters i.e. (character 0-99 taking into account that indexing in perl starts from 0) at once

    2. look for the small substrings and

    3. Bin 2: then again read the 1-100 characters and look for small substrings

    4. Bin3: read 2-101 characters and so on until you have scanned the entire sequence.

      Sorry 'nall, but that is a ludicrous suggestion.

      On a file of around 1GB of randomly generated DNA data:

      C:\test>dir randDNA.txt 10/06/2012 01:13 1,036,000,000 randDNA.txt C:\test>head randDNA.txt ATAGTAAGGGACCTCAGCGAGTGTCATAAATATAAGTTGTCGAGAGGTAACGATAGACGCCAATCACTTT +TA GCATCCAGGGGGTCAGTGTCCTAGCGACGTGGAACAACGACTACGCTTCGTAGGTCTCACCGTATAGATG +CC CGGGAGGCCTGCAAAGGAGTGAAGGGTAACGCCTGAACCCTTTGGCCTATCTACGTCGAGATTTCTACCG +GA GCGGAGATCTCCCCCCGGATTTCGTCAAATTCTGGAAATAAGTGTAGCAACCGAACGGTATAGCCAGATA +AT GCTCGAGCACACGCGGACGGTCTCAGAAACTAATTTTCTTAAGCTGGAACAGGCAACCAAAGATTTTAGA +TT ATCGGACGTAGCCAGAAGTGCGGATTTACAGCAACGCCTTTCTCAAAAGTTGCCGTCCCGCGGCACTAAT +AC ACCGATATGAAGGCGCTGAAACGATTATGTGTAGTGACGTGCCTTTCAGCGGCTATGGACGCTATCCCCG +CA GTCATGAGTCCAATTTGGGGTTAGCTGAAATAACCTGCTGTCCCCTAAAATTGTCGCATTCAAGCAGGGT +GG CGGGTACACATGCTAGCATCCGGACGCTATAAGGGCTCCCTTAGTAACATTTCCACTTTCTTGATATTTG +TG GGTGCGTTTAACGACGTCATTACTATGAGAGTCGGTATAGCCATCACATAATGACTCGAGCTTACGTCCT +AC

      This short script reads that into a single scalar and searches it for a single short sequence and prints out the 15000+ offsets where it is found in just over eleven seconds:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time(); local $/; my $DNA = <>; $DNA =~ tr[\n][]d; my $seek = 'AGAGAGAA'; my $p = 0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; printf STDERR "Took %.3f seconds\n", time() - $start; __END__ C:\test>DNAsearch1 randDNA.txt | wc -l Took 11.281 seconds 15313

      This does the same thing using your 100 bytes-at-a-time method:

      #! perl -slw use strict; use Time::HiRes qw[ time ]; my $seek = 'AGAGAGAA'; my $start = time(); my $file = shift; open DNA, '<', $file or die $!; my $size = -s( *DNA ); for my $o ( 1 .. $size - 100 ) { read( DNA, my $DNA, 100 ); $DNA =~ tr[\n][]d; my $p =0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; seek( DNA, $o, 0 ); } printf "Took %.3f seconds\n", time() - $start; __END__ [ 1:32:07.24] C:\test>DNAsearch2 randDNA.txt | wc -l 1441865 [ 4:12:28.90] C:\test>

      It has been running for 30+ minutes now and I don't expect it to finish anytime soon, so I'll leave it running and report back tomorrow.

      Updated above: Over 2 1/2 hours and 1.4 million hits instead of 15,000.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?

      Why do you shift your window by one character at a time?

      Think it over again. You can do much faster by moving your window to the end of the previous section less the length of the substring to match plus one. That will make sure that any small substring split over the present window and the next, will now be completely in the next window.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: How to store large strings?
by BlueSkyDreamer (Initiate) on Jun 11, 2012 at 21:29 UTC
    You probably want to create standalone BLAST database ant run search against it. http://www.bioperl.org/wiki/BLAST