How to store large strings?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to store large strings? by BrowserUk (Patriarch) on Jun 09, 2012 at 20:05 UTC
I can't assign the whole sequence to a variable. That doesn't make any sense, This shows perl storing 6 billion characters in a single variable: `[0] Perl> $a = 'x'x 1e6;; [0] Perl> $a x= 6000;; [0] Perl> print length $a;; 6000000000` [download] So there is more to your problem than you are describing. The best thing you can do, is post the code that is giving you the problem. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: How to store large strings? by ww (Archbishop) on Jun 09, 2012 at 21:45 UTC
Lemme' guess... Computer 1 is a TRS-80; computer 2 may have a Z80 CPU and run CP/M.... Congratulations on porting Perl to either... :-) Seriously, a quick check of questions posted here suggests that others working with "large DNA sequences" are working with strings f a a a r longer than those you cite. Unless you're working on a shared system with ~~incredibly~~; no, make that "unbelieveably" tight user-space limits, there's something missing in the picture you've presented so far.	[reply]
Re^2: How to store large strings? by CountZero (Bishop) on Jun 10, 2012 at 07:20 UTC
Computer 1 is a TRS-80; computer 2 may have a Z80 CPU and run CP/M Fond memories! A TRS-80 was my first computer, but unless my memory has failed me utterly, it already had a Z80 CPU (and with the "expansion interface" a whopping 48K of RAM). CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: How to store large strings? by thomas895 (Deacon) on Jun 09, 2012 at 21:39 UTC
Nonsense! Your computer would have to have some outrageously low memory installed for that to be the case. Unless, of course you are using some ancient machine with 640K of RAM. But even then, I think it should still fit. You could try an array as well, just split the string every 5000 characters or so. Although, I don't think that will help much, if at all. We'll need to see your code in order to give any further advice. ~Thomas~ `confess( "I offer no guarantees on my code." );`	[reply] [d/l]
Re: How to store large strings? by snape (Pilgrim) on Jun 09, 2012 at 23:41 UTC
Since your goal is: "is to find small substrings within large DNA sequences". It is better than you look for the small substrings by dividing the entire genome into many bins of particular size, for example: if you have DNA sequence of 1 million characters then don't read the entire million character at once. You should 1. Bin1: read first 100 characters i.e. (character 0-99 taking into account that indexing in perl starts from 0) at once 2. look for the small substrings and 3. Bin 2: then again read the 1-100 characters and look for small substrings 4. Bin3: read 2-101 characters and so on until you have scanned the entire sequence.	[reply]
Re^2: How to store large strings? by BrowserUk (Patriarch) on Jun 10, 2012 at 00:31 UTC
Sorry 'nall, but that is a ludicrous suggestion. On a file of around 1GB of randomly generated DNA data: C:\test>dir randDNA.txt 10/06/2012 01:13 1,036,000,000 randDNA.txt C:\test>head randDNA.txt ATAGTAAGGGACCTCAGCGAGTGTCATAAATATAAGTTGTCGAGAGGTAACGATAGACGCCAATCACTTT +TA GCATCCAGGGGGTCAGTGTCCTAGCGACGTGGAACAACGACTACGCTTCGTAGGTCTCACCGTATAGATG +CC CGGGAGGCCTGCAAAGGAGTGAAGGGTAACGCCTGAACCCTTTGGCCTATCTACGTCGAGATTTCTACCG +GA GCGGAGATCTCCCCCCGGATTTCGTCAAATTCTGGAAATAAGTGTAGCAACCGAACGGTATAGCCAGATA +AT GCTCGAGCACACGCGGACGGTCTCAGAAACTAATTTTCTTAAGCTGGAACAGGCAACCAAAGATTTTAGA +TT ATCGGACGTAGCCAGAAGTGCGGATTTACAGCAACGCCTTTCTCAAAAGTTGCCGTCCCGCGGCACTAAT +AC ACCGATATGAAGGCGCTGAAACGATTATGTGTAGTGACGTGCCTTTCAGCGGCTATGGACGCTATCCCCG +CA GTCATGAGTCCAATTTGGGGTTAGCTGAAATAACCTGCTGTCCCCTAAAATTGTCGCATTCAAGCAGGGT +GG CGGGTACACATGCTAGCATCCGGACGCTATAAGGGCTCCCTTAGTAACATTTCCACTTTCTTGATATTTG +TG GGTGCGTTTAACGACGTCATTACTATGAGAGTCGGTATAGCCATCACATAATGACTCGAGCTTACGTCCT +AC [download] This short script reads that into a single scalar and searches it for a single short sequence and prints out the 15000+ offsets where it is found in just over eleven seconds: `#! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time(); local $/; my $DNA = <>; $DNA =~ tr[\n][]d; my $seek = 'AGAGAGAA'; my $p = 0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; printf STDERR "Took %.3f seconds\n", time() - $start; __END__ C:\test>DNAsearch1 randDNA.txt \| wc -l Took 11.281 seconds 15313` [download] This does the same thing using your 100 bytes-at-a-time method: `#! perl -slw use strict; use Time::HiRes qw[ time ]; my $seek = 'AGAGAGAA'; my $start = time(); my $file = shift; open DNA, '<', $file or die $!; my $size = -s( DNA ); for my $o ( 1 .. $size - 100 ) { read( DNA, my $DNA, 100 ); $DNA =~ tr[\n][]d; my $p =0; printf "%s found at position %d\n", $seek, $p while $p = 1+index $DNA, $seek, $p; seek( DNA, $o, 0 ); } printf "Took %.3f seconds\n", time() - $start; __END__ [ 1:32:07.24] C:\test>DNAsearch2 randDNA.txt \| wc -l 1441865 [ 4:12:28.90] C:\test>` [download] It has been running for 30+ minutes* now and I don't expect it to finish anytime soon, so I'll leave it running and report back tomorrow. Updated above: Over 2 1/2 hours and 1.4 million hits instead of 15,000. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^2: How to store large strings? by CountZero (Bishop) on Jun 10, 2012 at 07:16 UTC
Why do you shift your window by one character at a time? Think it over again. You can do much faster by moving your window to the end of the previous section less the length of the substring to match plus one. That will make sure that any small substring split over the present window and the next, will now be completely in the next window. CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James My blog: Imperial Deltronics	[reply]
Re: How to store large strings? by BlueSkyDreamer (Initiate) on Jun 11, 2012 at 21:29 UTC
You probably want to create standalone BLAST database ant run search against it. http://www.bioperl.org/wiki/BLAST	[reply]