comment on

Please allow me to confirm your understanding of my problem:

1. The DNA sequence that I am feeding the software ranges in sizes but is large, often ~250MB or larger. But you understood that right. It is much larger than the sliding window size by at least 2 orders of magnitude

2. Yes, I've already run this software on DNA from several different species I now want to estimate the FDR for each of these species.

3. There is no (unstated) aim of this analyses other than reporting # elements for each species (that analysis is done) AND the FDR for each species (that I am having issues for which I seek help here). I am NOT trying to identify header and trailer terminal sequences common to all species. That is a good segway into the next point...

4. Indeed, I DO supply the software with 2 separate libraries of LCVs one for the headers, another for the trailer sequences that are supposed to be 'bona fide' based on independent verification - either experimental or some other bioinformatic approach. LCVs are supposed to be similar to profile HMMs, but that is all I know about LCVs at this point.

5. This is an important point: You ask if I want to eliminate false positives. This might be opening a can of worms, BUT, the short answer to that is NO. What I am REALLY trying to do is to count and compare # of hits with regular Vs randomized DNA inputs, to simply assess and report FDR. Due to the shuffling, IMO it would be quite complex to "identify" preserved elements and "lost" elements. Rather than "identify" true elements, I just want to report how many of them are likely false positives

Problem 1 : As I see it, due to the nature of the shuffling being random, I imagine every time I do this random shuffling, and THEN predict the # of elements, I would obtain different results each time. Ideally a workaround would be to shuffle a large number of times to assess FDR that is more reliable. But due to time constraints due to run time for the software, this is not viable. So I am concerned about the statistical validity of 1 random shuffle. I don't think this can be circumvented by shuffling the same input DNA sequence 20 times and then providing this as input. Though such iterative shuffle will no doubt randomize the input sequence much better IMO, it would still produce ONLY 1 FDR value. So that would still be not be reliable. Right?

Problem 2 : The observation I make and report in the math stack exchange post about the # of elements following a trend when the length of the sliding window is changed, worries me about using the 1MB recommended by the author. The FDR is lower at sliding window length 1MB than for 10bp, or 50p or 100bp.... and I wonder what the 'valid' length for shuffling DNA would be. In other words, there are also biological criteria that need to be imposed so that the shuffle is biologically meaningful. With the lure of trying to report lower FDR, did the author incorrectly use 1MB for sliding window within which the DNA is shuffled? Is the FDR actually higher, and should it be based on sliding window length that is in length ~ length of the header and trailer sequences?

I do not know if problems 1 and 2 above are real or I've imagine them. If they are real, then I am NOT tied to the idea of shuffling DNA. If there is any other solution that math / biology proficient Monks can think of, to assess and report FDR, I am all eyes and ears. Thank you!

In reply to Re^2: Window size for shuffling DNA? by onlyIDleft
in thread Window size for shuffling DNA? by onlyIDleft

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.