in reply to Re^2: Window size for shuffling DNA?
in thread Window size for shuffling DNA?

Rather than "identify" true elements, I just want to report how many of them are likely false positives

An answer: there is no (none, zero, zilch) statistical basis for predicting the FDR, based upon your process, regardless of the window length.

Further, I do not believe that there ever could be any correlation between the length of the window in which you randomise, and any meaningful statistic about real-world DNA.

Basis of conclusion: Instinct. My gut feel for requirements for Monti-Carlo simulations to produce statistically valid results; having constructed and run many hundreds of such simulations over the years.

Caveat: Not one of my simulations had anything to do with DNA or genomics; and I know next to nothing about the subject.

You cannot draw a statistically valid conclusion based upon 1 (or even a few) random trials; when shuffling just 50 bytes of your sequences can have 1,267,650,600,228,229,401,496,703,205,376 possible outcomes.

However, if you want a mathematically sound, statistical assessment of your question, then you are going to have to describe much more of the detail of the process; and far more accurate assessments of the ranges of the numbers involved. See below for some of the questions arising.


Warning: what follows may come across as "angry". It isn't. It's expressed this way to make a point.

How do you expect to get assistance, when you ask: a statistics question; of a bunch programmers; and conceal everything in genomics lingo?

What the &**&&^% are:

You say "I DO supply the software with 2 separate libraries of LCVs one for the headers, another for the trailer sequences that are supposed to be 'bona fide' based on independent verification".

That is an almost entirely useless description:

Conclusion: based upon the information you've supplied so far; and my own experience of drawing conclusions based upon random simulations; I see no basis for any meaningful conclusions with regard to false discovery rates.

But:

  1. I've only understood about 70% of the information you supplied.
  2. You've only supplied about 10% of the information required to make a proper assessment of the process.

If you were to describe that 3rd party process in detail: What are the inputs (a genome and 2 libraries; but how big, and other constraints); and what are its outputs. The fact that your graph appears to tail off as the length of the window increases is, of itself, meaningless. It also seems to initially increase. And both could simply be artifacts of the set of randomisations that occurred in this run.

How many runs would be required to draw a conclusion? There is no way to determine that from the information you have provided so far.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re^4: Window size for shuffling DNA?
by onlyIDleft (Scribe) on May 18, 2015 at 22:01 UTC

    FDR = False Discovery Rate = # false positive / total # * 100 %

    LCV = Local Combinational Variables, never heard of it before. The author of the software has an earlier paper using it for some other bioinformatic purpose. This paper is at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761287/ Not sure if it is even directly relevant, but throwing it out there

    In DNA sequence parlance, 1 letter = 1 base pair = 1bp (abbreviation)

    Therefore 1000bp = 1KiloBasePair = 1KB (abbreviation)

    Likewise 10^6bp = 1MegaBasePair = 1MB, and so on...

    pHMM = profile Hidden Markov Model used to create probabilistic models of insertions, deletions and substitutions for protein or DNA sequence multiple alignments, more info about this may be gleaned from http://en.wikipedia.org/wiki/Hidden_Markov_model However, this may be a distraction since the software does NOT use pHMMs, but LCVs - on which I cannot find any theory, from just a Google search

    Number of different header sequences is 304 in the head LCVs library

    Number of different trailer sequences is 576 in the tail LCVs library

    Length variation of header sequence sin the head LCVs library: ~10 - 50

    Length variation of trailer sequence sin the tail LCVs library: ~10 - 50

    The 3rd party software in step 1, detects matches to head LCVs separately, then matches to tail LCVs separately again

    After this step 1, the software, in step 2, joins these heads and tails into pairs. Default parameters limit the intervening length between any given head and tail between 20 and 20,000 letters. In other words, if head and tail combinations are shorter than 20bp or longer than 20KB, they will be ignored. Please note that software is NOT looking in any form or manner for matches to ANY intervening sequences between the head and the tail matches. It is ONLY looking for the matches to the head and tail LCVs per se, and then pairing them, and then imposing the size range (20bp-20KB) filter to report the elements

    IMO I do not think graph tailing off is meaningless. For the following reason : I ran these tests on randomized DNA sequences of completely different species (3 shown in the figure, 2 other species not in the figure) all 5 of which show the exact same trend. It would be unlikely to see the exact trend for all 5 species, unless this itself is a random event... So there is something going on, that I don't understand...and may be it is not easy to explain...

      On your graph what does the Y-axis value represent? Detections; or false detections?

      If the latter, how do you make the determinations of what is a false detection rather than a detection?

      And what are the 'other' corresponding values for detections? (Ie. When the value for M.truncatula is ~2000; if that represents detections, what is the false detections value; or vice versa?)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Y-axis is # of element reported on a certain genome that has been scrambled. X - axis is the size of the sliding window within which this random shuffling was performed. The assumption made here by the author, and I am advancing that notion, (not sure if it is entirely correct or not, nevertheless) that when I discover elements on a scrambled genome, it has to be, by definition, a false positive

        Conversely, the elements that I discover and report on the original, unshuffled genome, have to be, by definition, true positives

        Just the comparison of original Vs unscrambled genomes in terms of # of elements reported in each case, was used by the software author to report FDR. FDR = (# element in shuffled genome) / (# elements in original genome) * 100 (in %)

        My chart does NOT show the # of elements in the original genome without any DNA random shuffling. Those numbers are as follows:

        A. thaliana (original genome, no DNA shuffle) - 885 elements

        B. rapa (original genome, no DNA shuffle) - 3686 elements

        M. truncatula (original genome, no DNA shuffle) - 1808 elements

        As expected, these numbers above, for the unshelled genomic DNA as input, yields higher # of elements than for the same genomes that have undergone random DNA shuffling (irrespective of what the sliding window size is. So at least in this context, I am seeing what is 'expected' in terms of the shuffled genome serving as a negative control, and yielding fewer # of elements than for randomly shuffled genomes.