I recently uploaded
Net::Random to the CPAN. It gathers data from a couple of online sources of truly random data (which I trust to really *be* random, that's not the issue here), and uses that to generate random numbers in the user's chosen range. For instance, you might want a bunch of random 0s and 1s to simulate tossing a coin, or random numbers from 1 to 6 to simulate a die roll.
Given that I trust the original data to be random, I still need to be sure that what I'm doing to the data isn't biassing it.
Such bias could be introduced in various ways, the two I can think of off the top of my head are:
- my algorithm sucks
- an off-by-one error
but there are no doubt other ways I could screw up. The whole point of testing is that I don't need to know in advance how I might have screwed up, the tests just show that I *have* screwed up.
The question is, then, how to test that my output data is nice and random? I initially thought of using Jon Orwant's Statistics::ChiSquared module, but that has a couple of big drawbacks:
- it thinks a coin that throws 500 heads followed by 500 tails is just fine and dandy;
- it's limited to 21 discrete values because of the way its implemented
The second of those is a headache that can be worked around. The first, however, is a showstopper. That test can't detect certain types of obvious bias. So, what I'm looking for is a module that:
- can determine whether data is evenly and randomly distributed across its range and is equally evenly distributed regardless of which part of the sample i look at (ie the first 20 values should be just as random as the next 100); and
- can determine whether the data is at all predictable (ie can it detect that if the die rolls a 1 it's likely to roll a 4 three rolls later, or if it rolls a 1 it won't roll a 1 next time)
I'm not aware of anything on CPAN that can do that. An alternative would be - and we can do this because I'm only concerned about whether *I* am introducing bias, not with whether the data is biassed - to check that the distribution of my results is the same as the distribution of the original data. But I'm not aware of anything to do that either.
So, can anyone point me at any appropriate modules? Or at an algorithm that I could turn into a module?
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.