comment on

I have exactly the same suggestion as hdb. To put it more formally, build the statistical distribution of your data and then sample it. If you need a more complex model which will also depend not only on frequency of single letters but also on frequency of pairs and triplets and so on, then build a multi-dimensional distribution or use a markov-chain. The latter is the tool being used for random text production which more or less does what you want: create new words from a corpus respecting their neighbouring properties as well as frequency of individual letters. So, it does one step more than what you do as it counts also the properties of neighbours. For example, with your method, "TTTTTTT" is as probable as "ATCGACACGT" in the shuffled sub-sequences, but in the real world the first one is a freak sequence and the second one just a normal looking one...

Comment on your code: I think reversing the array will not add anything to the shuffle (if shuffle() is good, which is good). Secondly, you can eliminate a lot of split()/join() (after the shuffle) and just use substr() to break the whole genome to sub-sequences. Thirldy, refactoring your code to use subs, as hippo suggested will be good because then you can assess the efficiency of each step (RAM/CPU-wise as well as statistically) and you can try different methods by writing a new sub and plugging it in.

Update: Eily also makes the same comment about statistics.

bw, bliako

In reply to Re: Reduce RAM required by bliako
in thread Reduce RAM required by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.