Okay. I ran your code again and got:
C:\test>junk123 best_shuffle : There's a <1% chance that this data is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a >5% chance, and a <10% chance, that this data +is random.
That's not right!

So, I though about my example code and looked at what it was intended to demonstrate.

That despite the use of a completely bogus rand() function, a Fisher-Yates shuffle would still operate; and produce results:

  1. That all possible shuffles of the data were being produced. I chose to shuffle 4 values because the 24 possible results fit on a screen and are simple to verify manually.
  2. That they were produced with (approximately) the same frequency. Ie. The number of times each possible shuffle was produced were approximately equal and approximately 1/24th of the total runs.

In that respect, it served its purpose.

But, if you are going to formally test a shuffle, using only 4 value arrays and 1e6 iterations probably isn't the ideal scenario to test.

To that end I tweaked your code to allow me to adjust both parameters from the command line:

our $NUMTESTS //= 1e6; our $ASIZE //= 4; ... my @vals = ( 1..$ASIZE );

And, given that Chi2 is generally used to determine whether a (smallish) sample is representative of a large and thus unknown population, I tried using a (moderately) larger array:

C:\test>junk123 -ASIZE=40 -N=1e5 best_shuffle : I can't handle 100000 choices without a better table. good_shuffle : I can't handle 100000 choices without a better table. bad_shuffle : I can't handle 100000 choices without a better table. C:\test>junk123 -ASIZE=40 -N=1e4 best_shuffle : I can't handle 10000 choices without a better table. good_shuffle : I can't handle 10000 choices without a better table. C:\test>junk123 -ASIZE=40 -N=1e3 best_shuffle : I can't handle 1000 choices without a better table. good_shuffle : I can't handle 1000 choices without a better table. bad_shuffle : I can't handle 1000 choices without a better table. C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a >99.5% chance, and a <100% chance, that this +data is random. good_shuffle : There's a >99.5% chance, and a <100% chance, that this +data is random. bad_shuffle : There's a >99.5% chance, and a <100% chance, that this d +ata is random.

And once I found a sample size of that larger array that the module could handle, did a few "identical" runs:

C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a <1% chance that this data is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a >5% chance, and a <10% chance, that this data +is random. C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=40 -N=1e2 best_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a >10% chance, and a <25% chance, that this data + is random.

Hm. Not exactly confidence inspiring.

Let's try some middle ground:

C:\test>junk123 -ASIZE=11 -N=1e5 best_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. good_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e5 best_shuffle : There's a >25% chance, and a <50% chance, that this dat +a is random. good_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e4 best_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. good_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. bad_shuffle : There's a >75% chance, and a <90% chance, that this data + is random. C:\test>junk123 -ASIZE=11 -N=1e4 best_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. good_shuffle : There's a >50% chance, and a <75% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e4 best_shuffle : There's a >10% chance, and a <25% chance, that this dat +a is random. good_shuffle : There's a >5% chance, and a <10% chance, that this data + is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e2 best_shuffle : There's a >10% chance, and a <25% chance, that this dat +a is random. good_shuffle : There's a >75% chance, and a <90% chance, that this dat +a is random. bad_shuffle : There's a <1% chance that this data is random. C:\test>junk123 -ASIZE=11 -N=1e2 best_shuffle : There's a >1% chance, and a <5% chance, that this data +is random. good_shuffle : There's a >90% chance, and a <95% chance, that this dat +a is random. bad_shuffle : There's a >1% chance, and a <5% chance, that this data i +s random.

Sure, it's guessing that the known, deliberately really bad rand is producing poor results most of the time, but its also making the same guess about the known good rand with surprisingly high frequency.

Two possibilities:

  1. the test is implemented badly;
  2. it is the wrong test for this kind of data.

I suspect a little of both is at work here.

I'm far from an expert on stats, but I don't believe that Chi2 is the right test for the kind of samples this produces; and I cannot see any reference to Yates correction in the module.

More later.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit

In reply to Re^4: Shuffling CODONS by BrowserUk
in thread Shuffling CODONS by WouterVG

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.