Re^6: Memory Usage in Regex On Large Sequence

Hi Browser,

Yup, removing Bio::SeqIO doesn't resolve the crashes but it dramatically speeds up execution. Further, the program runs quite a bit deeper before crashing. I also tried reducing the number of threads in the BioPerl-less program, and found that the fewer threads active at any one time, the deeper execution can go. So, with just 3 threads it completes, but with 5 it terminates about 80% of the way through the dataset. That seems to suggest that the threads are using a lot of memory: is there a way of assessing the memory footprint of an individual threads?

Also, you had previously mentioned that it might be useful to just create a pool of threads at the front, and to reuse them. I'd like to try that out, but I'm unsure how to do that, and didn't see anything in perldoc threads, but I could have missed it. My thought here is that perhaps threads are "leaking" some memory on my system, and reusing threads might help identify that.

Many thanks (again) for your help

Comment on Re^6: Memory Usage in Regex On Large Sequence Download Code

Replies are listed 'Best First'.
Re^7: Memory Usage in Regex On Large Sequence by BrowserUk (Patriarch) on Sep 26, 2006 at 23:08 UTC
is there a way of assessing the memory footprint of an individual threads? I know how to do that on Win32, but I've no experience of threads *nix stuff. Maybe `top` shows you something? How big is/are the sequences you are searching? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^8: Memory Usage in Regex On Large Sequence by bernanke01 (Beadle) on Sep 27, 2006 at 00:45 UTC
I don't have privileges for top, but I just put in a req for that, so I should be able to test out in a day or two. These are complete chromosomes, so they range in size from about 20kbp (20,000 letters) for the mitochondria to about 250 million bp for chromosome 1.	[reply]
Re^9: Memory Usage in Regex On Large Sequence by bobf (Monsignor) on Sep 27, 2006 at 03:25 UTC
If you're running into memory issues after trying to load the entire chromosome, could you simply split the sequences up into smaller fragments (e.g., <= 200,000 bp each) before searching them for motifs? This is a common technique in biological sequence analysis, and it may help you spread the load a bit more evenly across threads. If you take this approach, make sure you overlap the fragments so you don't miss any motifs at the junction. Finally, if you're doing this repeatedly, you can save the fragments as separate files so you only have to process them once (the offset of each fragment can be included in the filename for easy reference). HTH	[reply]
Re^9: Memory Usage in Regex On Large Sequence by BrowserUk (Patriarch) on Sep 27, 2006 at 01:20 UTC
Hm. Should've asked that question earlier:) It's been so long since I used a system without admin privileges that I forgot that some places (needlessly) restrict 'users' from even the most elementary of debugging and system information tools. I just assumed that you would have already verified whether you were simply asking for more memory than your OS could let a process have. If you are trying to run 10 concurrent threads each having loaded 1/4 GB of data you're quite likely to be blowing the process memory limit. On an Intel 32-bit processor that is likely to be 2 GB. Even if you are on hardware that allows much larger process memory, it is possible that there are admin imposed memory limits coming into play. That is something you would need to ask your admin about. I realise that many of the chromosomes are probably well under that 1/4 GB size, and the problem will only arise if the confluence of 10 big ones comes together, but given that the bigger chromosomes will take longer to process, it is almost enevitable that they will. Ie. Even if the sizes are randomly distributed, the small ones will be processed quickly and so you will nearly always end up at the situation where you are trying to process 10 big ones at the same time. One way around this would be to alter your thread management strategy accordingly. Instead of limiting by the number of threads running, limit by the combined size of the chromosomes you are processing: As you load each chromosome, add it to a running total. If `the running total of data loaded + 10 MB * the number of running threads + the size of the next chromosome (the filesize should be a good enou +gh approximation for this purpose)` [download] is less than the memory limit for your process, start another thread on the next chromosome. Otherwise wait until a thread terminates and subtract it's memory usage from the running total. There are various ways you could adjust that algorithm to try and balance the number of threads and the memory consumed, but as is it should be prevent the current problem. Assuming that process memory is being exceeded as now seems likely. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]