in reply to Reduce RAM required

Aren't the three letters in your sequences basically equiprobable (you have as many 'a', 't', 'g' and 'c' roughly appear as many times as each other)? If so, do you need the output file to have exactly the same number of occurence of each letter as the input? I'm not a biologist but the arbitrary - yet quite large - window size and shuffling makes me doubt the data is supposed to be meaningful in any way. Besides:

# throw in some reverse sequence alternatively to shuffle it more randomly
is useless at best. Trying to increase the randomness of some data without external input is either going to have no effect on the probabilities, or most likely make the output less random.

If you don't care about matching exactly the occurence of each letter, your program just becomes "replace each sequence by a random sequence the same length", which can be coded in a few lines. If you do care, hdb's answer might be the way to go.

Replies are listed 'Best First'.
Re^2: Reduce RAM required
by onlyIDleft (Scribe) on Jan 09, 2019 at 16:27 UTC

    1. ATGC frequencies are never equal except by fluke

    2. DNA sequences have periodicities or features at different size ranges

    3. Sequence read in forward and reverse orientations almost always yield different biological meanings

    But about how reversing alternatively, changing shuffle window size etc will change signal / noise ratio would be a multi-week study by itself

    So your advice on keeping it simple and easy in terms of the coding is well taken :) Thank you!