Re^7: Memory Usage in Regex On Large Sequence

Replies are listed 'Best First'.
Re^8: Memory Usage in Regex On Large Sequence by bernanke01 (Beadle) on Sep 27, 2006 at 00:45 UTC
I don't have privileges for top, but I just put in a req for that, so I should be able to test out in a day or two. These are complete chromosomes, so they range in size from about 20kbp (20,000 letters) for the mitochondria to about 250 million bp for chromosome 1.	[reply]
Re^9: Memory Usage in Regex On Large Sequence by bobf (Monsignor) on Sep 27, 2006 at 03:25 UTC
If you're running into memory issues after trying to load the entire chromosome, could you simply split the sequences up into smaller fragments (e.g., <= 200,000 bp each) before searching them for motifs? This is a common technique in biological sequence analysis, and it may help you spread the load a bit more evenly across threads. If you take this approach, make sure you overlap the fragments so you don't miss any motifs at the junction. Finally, if you're doing this repeatedly, you can save the fragments as separate files so you only have to process them once (the offset of each fragment can be included in the filename for easy reference). HTH	[reply]
Re^9: Memory Usage in Regex On Large Sequence by BrowserUk (Patriarch) on Sep 27, 2006 at 01:20 UTC
Hm. Should've asked that question earlier:) It's been so long since I used a system without admin privileges that I forgot that some places (needlessly) restrict 'users' from even the most elementary of debugging and system information tools. I just assumed that you would have already verified whether you were simply asking for more memory than your OS could let a process have. If you are trying to run 10 concurrent threads each having loaded 1/4 GB of data you're quite likely to be blowing the process memory limit. On an Intel 32-bit processor that is likely to be 2 GB. Even if you are on hardware that allows much larger process memory, it is possible that there are admin imposed memory limits coming into play. That is something you would need to ask your admin about. I realise that many of the chromosomes are probably well under that 1/4 GB size, and the problem will only arise if the confluence of 10 big ones comes together, but given that the bigger chromosomes will take longer to process, it is almost enevitable that they will. Ie. Even if the sizes are randomly distributed, the small ones will be processed quickly and so you will nearly always end up at the situation where you are trying to process 10 big ones at the same time. One way around this would be to alter your thread management strategy accordingly. Instead of limiting by the number of threads running, limit by the combined size of the chromosomes you are processing: As you load each chromosome, add it to a running total. If `the running total of data loaded + 10 MB * the number of running threads + the size of the next chromosome (the filesize should be a good enou +gh approximation for this purpose)` [download] is less than the memory limit for your process, start another thread on the next chromosome. Otherwise wait until a thread terminates and subtract it's memory usage from the running total. There are various ways you could adjust that algorithm to try and balance the number of threads and the memory consumed, but as is it should be prevent the current problem. Assuming that process memory is being exceeded as now seems likely. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]