Let's see if I understand your question, by summarising the descriptions you've given.
You start with a real DNA sequence of ~100MB length; which when you pass it to a 3rd party program, is searched for a particular sequence (or sequences?) that are between 20bytes and 20kbytes in length, and are identified by the presence of two sequences (of ~50bytes) at either end of the wanted sequence.
Eg.
...xxxxxxxxHEADERxxxxx 20-20k bytes xxxxxTRAILERxxxxxxxxxx....
From your graph, I suspect that you run this process on several (3 shown) real DNA sequences?
Further, I suspect that the (unstated) aim of this process is to identify the ~50 byte header and trailer sequences that are common to all the different DNA sequences, that delimit a 'common subsequence' of dna across species?
(Do you supply the header & trailer sequences to the 3rd party program?)
The purpose of your windowed shuffling process is to mix-up the real DNA -- in a locally, statistically similar, but randomised -- way, in order to eliminate false positives, such that if the header and trailer sequences you've previously identified are still found in the randomised sequences, then they are probably not good candidates for identifying common sequences;
And your question is asking whether the way you are randomising the sequences, via this windowing mechanism, is statistically valid.
In reply to Re: Window size for shuffling DNA?
by BrowserUk
in thread Window size for shuffling DNA?
by onlyIDleft
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |