in reply to Re^8: Last comment (and nail in the coffin of) of S::CS :)
in thread Shuffling CODONS

*I'm not at all convinced that these two set of values are "more or less the same": (pvalues=,0.4869334,0.4961692,0.8251134,0.2657584,0.84692,0.1296479,0. +504212,0.9028209,0.8082847,0.2999607,0.154672,0.1660518,0.5143663,0.8 +120685,0.4452244,0.6561128,0.6123136,0.6994308,0.9302561,0.4757345) pvalues=,0.25,0.25,0.75,0.25,0.75,0.1,0.5,0.9,0.75,0.25,0.1,0.1,0.5,0. +75,0.25,0.5,0.5,0.5,0.9,0.25)

Right, so p-values sometimes are comparable, sometimes are not. R must be calculating p-values using some kind of monte-carlo. That's why they have the granularity you observed (as a ratio of success/total_trials or something). Whereas in Statistics::ChiSquare there is a lookup table column for each p-value (the 100%, 95 etc.) someone thought useful to include. And so the latter must return a p-value range for which the (correctly as far as I can see from reading diagonally the output) calculated statistic falls in. If the range in the table is small (e.g. 100-99) then p-values will more-or-less coincide (probably what I posted earlier).

If however the range is big (e.g. 50-25) then they will coincide only when R's calculated p-value is near the upper range. The cure for this is to use R's method for calculating p-vs which I guess is state-of-the-art. I don't think anyone wants to open that can of warms and start translating R recipes in pure perl. So, for me using once in a while Statistics::R for some obscure statistical methods is OK. For the record, all data was created in Perl and then was sent to R for just the chi-square test.

I agree with you also that chi-square may not be the perfect tool for this particular situation. It may well be (or accepted by statisticians) for others. At the end, one has to decide what they want out of a shuffle: destroy patterns in the sequence of the original dataset? Create random datasets (as you do where aim is all items to be equally present)? Or to split a mother set into smaller sets whose average properties approach the average of the mother set - for medical trials for example. Once the purpose is clear then a test can be used accordingly.

Being able to work out the efficacy of your shuffle is a useful thing but many don't bother with it. Let me know of your progress.

Oh, and thank you for providing some interesting diversion for a boring Sunday:)

likewise, though I did have a boring swim in between

  • Comment on Re^9: Last comment (and nail in the coffin of) of S::CS :)

Replies are listed 'Best First'.
Re^10: Last comment (and nail in the coffin of) of S::CS :)
by BrowserUk (Patriarch) on Jun 12, 2018 at 13:07 UTC
    I agree with you also that chi-square may not be the perfect tool for this particular situation. It may well be (or accepted by statisticians) for others. At the end, one has to decide what they want out of a shuffle:

    Okay, I did a little more exploration of S::CS, and I have a few questions for you. (Sorry that what follows reads a bit like a bad marketing survey :) ).

    Do you agree that ...

    • ... the Mersenne Twister algorithm is an excellent example of a PRNG?
    • ... Math::Random::MT is a correct implementation of the Mersenne Twister algorithm?
    • ... the Fisher-Yates algorithm is proven to produce fair shuffles?
    • ... my implementation we've been using is a correct implementation of that algorithm?
    .

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

      For the second question (is Math::Random::MT is a correct implementation of the Mersenne Twister algorithm?), I have no idea. But it does say (in its documentation) that

      Based on the C implementation of MT19937 Copyright (C) 1997 - 2002, Makoto Matsumoto and Takuji Nishimura

      If we interpret "Based" as "exact translation" then unless there is a bug in that code it must be correctly implemented. I guess.

      For the third question (is the Fisher-Yates algorithm proven to produce fair shuffles?), I quote from wikipedia's entry of FY

      Care must be taken when implementing the Fisher–Yates shuffle, both in the implementation of the algorithm itself and in the generation of the random numbers it is built on, otherwise the results may show detectable bias. A number of common sources of bias have been listed below...

      Assuming code is correct, then it all boils down to the RNG.

      If this source A Brief History of Random Numbers is right then the first PRNG came in 1946, long after FY (1938). Did FY know about biased RNG then or is it when PRNGs were introduced that RNG bias started being an issue? I don't know. But it is quite reasonable to ask any stochastic algorithm maker to tell me how their algorithm performs given a few different PRNGs.

      For the fourth question (is my implementation we've been using a correct implementation of that algorithm?), here is wikipedia's algorithm with later enhancements:

      -- To shuffle an array a of n elements (indices 0..n-1): for i from 0 to n-2 do j <- random integer such that i <= j < n exchange a[i] and a[j]

      First your "create random integer" part of the code

      $n = scalar(@aliased); $a = $_ + rand @aliased - $_ # original code = i + rand($n-i) # which becomes.... = i + choiceof(0, 1, ..., $n-i-1) # perl's rand(10) dumps 0-9 inclu +sive = random integer such that i <= j <= i+$n-i-1 = random integer such that i <= j <= $n-1 = random integer such that i <= j < $n

      I think that validates OK.

      Next the loop:

      for 0 .. $#aliased { $a = $_ + rand @aliased - $_; $b = $aliased[ $_ ], $aliased[ $_ ] = $aliased[ $a ], $aliased[ $a ] = $b }

      becomes

      for i from 0 to n-1 do j <- random integer such that i <= j < n exchange a[i] and a[j]

      The only discrepancy I can spot is that your loop is up to (n-1) instead of (n-2). Maybe I got something wrong.

      Regarding the first question now (is the Mersenne Twister algorithm an excellent example of a PRNG?), there are many situations which one PRNG is better than another. For example, Math::Random::MT's overview states

      This algorithm has a very uniform distribution and is good for modelling purposes but do not use it for cryptography.

      I have concocted three more tests in addition to the autocorrelation coefficient (already mentioned) and chi-squared (already mentioned) and of course the evenness of all outcomes in the long run:

      Random walk test, use the PRNG to do a random walk and if it is any good then random walk theoretical convergence will be confirmed, random_walk_test.pl (it is using Statistics::Test::RandomWalk)

      Compression test, gzip a random sequence of integers and see what compression ratio is achieved. A truly random sequence should not be compressed more than one which contains patterns, compression_test.pl :

      Build transition matrix for a sequence of random integers with specified lag, i.e. count the occurence of rand[i] and rand[i+lag] for each i. For lag=1 we count the transitions between neighbours, e.g. if current random number was 4 and next random number is 5 then increment count of counts[4][5]. For a true random sequence of integers we expect that all transitions have equal probabilities (counts), well they aint, transition_matrix_test.pl follows :

      Some results:

      ./random_walk_test.pl : done 10000000 trials with 10 bins, here is a s +ummary: ./random_walk_test.pl : RandomWalk test : Math::Random::MT's rand() wi +th mean %diff from expected to be 0.022 % ./random_walk_test.pl : RandomWalk test : Perl's rand() with mean %dif +f from expected to be 0.019 % ./random_walk_test.pl : best results for RandomWalk test : Perl's rand +() with mean %diff from expected to be 0.019 % ./random_walk_test.pl : end. ./compression_test.pl : done 100 repeats each time compressing a strin +g of random integers of length 1000000, here is a summary: ./compression_test.pl : Gzip compression test : Math::Random::MT's ran +d() with mean %diff from expected to be 72.231 % ./compression_test.pl : Gzip compression test : Perl's rand() with mea +n %diff from expected to be 72.232 % ./compression_test.pl : best results for Gzip compression test : Perl' +s rand() with mean %diff from expected to be 72.232 % ./compression_test.pl : end. ./transition_matrix_test.pl : done 2000000 repeats, with random intege +rs between 0-30 inclusive and a lag of 1, here is a summary: ./transition_matrix_test.pl : transition matrix test : Perl's rand() w +ith mean %diff from expected count to be 1.796 % ./transition_matrix_test.pl : transition matrix test : Math::Random::M +T's rand() with mean %diff from expected count to be 1.797 % ./transition_matrix_test.pl : best results for transition matrix test +: Perl's rand() with mean %diff from expected count to be 1.796 % ./transition_matrix_test.pl : end.

      For these 3 tests the 2 candidates (Math::Random::MT and perls builtin rand()) do the same ON THE LONG RUN. So, perl's rand() is a decent choice compared to the alternative (MT) but there are margins for improvement, re: the transition matrix test.

      bw, bliako