in reply to Re^3: Window size for shuffling DNA?
in thread Window size for shuffling DNA?

FDR = False Discovery Rate = # false positive / total # * 100 %

LCV = Local Combinational Variables, never heard of it before. The author of the software has an earlier paper using it for some other bioinformatic purpose. This paper is at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2761287/ Not sure if it is even directly relevant, but throwing it out there

In DNA sequence parlance, 1 letter = 1 base pair = 1bp (abbreviation)

Therefore 1000bp = 1KiloBasePair = 1KB (abbreviation)

Likewise 10^6bp = 1MegaBasePair = 1MB, and so on...

pHMM = profile Hidden Markov Model used to create probabilistic models of insertions, deletions and substitutions for protein or DNA sequence multiple alignments, more info about this may be gleaned from http://en.wikipedia.org/wiki/Hidden_Markov_model However, this may be a distraction since the software does NOT use pHMMs, but LCVs - on which I cannot find any theory, from just a Google search

Number of different header sequences is 304 in the head LCVs library

Number of different trailer sequences is 576 in the tail LCVs library

Length variation of header sequence sin the head LCVs library: ~10 - 50

Length variation of trailer sequence sin the tail LCVs library: ~10 - 50

The 3rd party software in step 1, detects matches to head LCVs separately, then matches to tail LCVs separately again

After this step 1, the software, in step 2, joins these heads and tails into pairs. Default parameters limit the intervening length between any given head and tail between 20 and 20,000 letters. In other words, if head and tail combinations are shorter than 20bp or longer than 20KB, they will be ignored. Please note that software is NOT looking in any form or manner for matches to ANY intervening sequences between the head and the tail matches. It is ONLY looking for the matches to the head and tail LCVs per se, and then pairing them, and then imposing the size range (20bp-20KB) filter to report the elements

IMO I do not think graph tailing off is meaningless. For the following reason : I ran these tests on randomized DNA sequences of completely different species (3 shown in the figure, 2 other species not in the figure) all 5 of which show the exact same trend. It would be unlikely to see the exact trend for all 5 species, unless this itself is a random event... So there is something going on, that I don't understand...and may be it is not easy to explain...

Replies are listed 'Best First'.
Re^5: Window size for shuffling DNA?
by BrowserUk (Patriarch) on May 19, 2015 at 17:22 UTC

    On your graph what does the Y-axis value represent? Detections; or false detections?

    If the latter, how do you make the determinations of what is a false detection rather than a detection?

    And what are the 'other' corresponding values for detections? (Ie. When the value for M.truncatula is ~2000; if that represents detections, what is the false detections value; or vice versa?)


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      Y-axis is # of element reported on a certain genome that has been scrambled. X - axis is the size of the sliding window within which this random shuffling was performed. The assumption made here by the author, and I am advancing that notion, (not sure if it is entirely correct or not, nevertheless) that when I discover elements on a scrambled genome, it has to be, by definition, a false positive

      Conversely, the elements that I discover and report on the original, unshuffled genome, have to be, by definition, true positives

      Just the comparison of original Vs unscrambled genomes in terms of # of elements reported in each case, was used by the software author to report FDR. FDR = (# element in shuffled genome) / (# elements in original genome) * 100 (in %)

      My chart does NOT show the # of elements in the original genome without any DNA random shuffling. Those numbers are as follows:

      A. thaliana (original genome, no DNA shuffle) - 885 elements

      B. rapa (original genome, no DNA shuffle) - 3686 elements

      M. truncatula (original genome, no DNA shuffle) - 1808 elements

      As expected, these numbers above, for the unshelled genomic DNA as input, yields higher # of elements than for the same genomes that have undergone random DNA shuffling (irrespective of what the sliding window size is. So at least in this context, I am seeing what is 'expected' in terms of the shuffled genome serving as a negative control, and yielding fewer # of elements than for randomly shuffled genomes.

        The assumption made here by the author, and I am advancing that notion, (not sure if it is entirely correct or not, nevertheless) that when I discover elements on a scrambled genome, it has to be, by definition, a false positive

        No quibble with that. The result of shuffling the DNA, is that it is no longer DNA. Anything detected is just random chance.

        But combining the number of hits in real DNA samples with numbers of hits found by chance in non-DNA samples, in a mathematical equation (your %FDR), is extremely dubious; if not just outright bogus.

        At the very best, all it gives you is some measure of the possibility that of the hits you find in the real DNA; some percentage of them might be down to chance. But it doesn't tell you if they are down to chance; and even if some of them are; it doesn't give you any informational way to determine which ones are down to chance.

        As such, it is a useless statistic. It's like knowing that the odds of any given pick of 6 numbers in the (UK) lottery has a 1 in 53.66 chance of picking up some prize. It doesn't help you pick a winning combination; much less pick one, that will win a major prize.

        So at least in this context, I am seeing what is 'expected' in terms of the shuffled genome serving as a negative control, and yielding fewer # of elements than for randomly shuffled genomes.

        Your original question asks if using a larger window when shuffling your DNA samples, reduces the chances of false positives; as appeared to be indicated by your graph.

        But for that to be true, the random state of your non-DNA sample would have to somehow influence the hits found in your unshuffled, real-DNA sample. And that simply cannot be. So, the answer must be: NO!

        The only affect that using a larger window might have is that by shuffling the characters over a wider base, it might(*) be less likely to random produce matches to your header/trailer libraries. But even if it does; that tells you exactly nothing about whether the hits found in the unshuffled, real-DNA sample are good or bad; because the two have literally nothing in common.

        You might just as well start with the digits of PI and map them to ACGT and search the result for matches, for all the bearing the results -- whatever they might come out to -- will have upon the efficacy of any matches you find in your real DNA samples. Ie. None whatsoever.

        *DNA is not random. I ran a crude process on a copy of the full human genome I already had on my harddisk and scanned the 2,861,343,839 (non-N) characters and collated all the unique 16-base subsequences therein. Amongst the 2,861,343,824 subsequences in the file (drawn from the 4 billion (2^32) possible 16-base subsequences), only 1,130,866,232 unique subsequences actually appear.

        Of those, 633,492,754 appear exactly once; another 188,580,306 appear less that 8 times; and 8,793,172 less than 256 times. The remaining 236,135 subsequences appear more than 256 times.

        The most frequent subsequences, 'aaaaaaaaaaaaaaaa' & 'tttttttttttttttt' appear nearly 1 million times each. These are the frequencies of the next 30 most frequent:

        332362 328795 327795 324360 203697 203018 201263 199475 199235 198964 +198412 197732 197184 196340 195806 195132 194474 194028 193476 192956 191843 191242 191019 190768 190628 +190278 182911 182452 180801 179857

        As you can see, real DNA is heavily biased when compared to purely random DNA, so shuffling real-DNA is a good way to produce a DNA-like overall mix of random bases;

        But once shuffled, it has no relationship to the real DNA it was derived from; and thus, there can be no correlation between any data derived by comparing the two.

        As for the size of the "sliding window" over which you shuffle the bases. There is some visual (but uncorroborated) evidence that real DNA has some locality bias also. That is, the same subsequences, if repeated, tend to appear in relatively close proximity to the duplicates. On that basis, it is probable that the effect of the larger window, is to 'more thoroughly mix' the bases, and thus the result tends to be less DNA-like; with the knock on effect that you are less likely to find matches to subsequences drawn from real-DNA. That could explain the graph you posted elsewhere.

        But, and I cannot emphasis this enough; regardless of how well you mix the bases; any matches (or lack thereof) found in the shuffled DNA, have exactly no correlation with; nor influence upon; nor any predictive or diagnostic utility when compared to the matches found in the real DNA.

        I have no knowledge of the experience/prowess/standing of the author of the paper you cited; nor do I understand its contents; but I am really very sure that combining numbers derived from real & shuffled DNA into a single equation is completely bogus math.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked