http://qs1969.pair.com?node_id=301587

DrHyde has asked for the wisdom of the Perl Monks concerning the following question:

I recently uploaded Net::Random to the CPAN. It gathers data from a couple of online sources of truly random data (which I trust to really *be* random, that's not the issue here), and uses that to generate random numbers in the user's chosen range. For instance, you might want a bunch of random 0s and 1s to simulate tossing a coin, or random numbers from 1 to 6 to simulate a die roll.

Given that I trust the original data to be random, I still need to be sure that what I'm doing to the data isn't biassing it.

Such bias could be introduced in various ways, the two I can think of off the top of my head are:
  • my algorithm sucks
  • an off-by-one error
but there are no doubt other ways I could screw up. The whole point of testing is that I don't need to know in advance how I might have screwed up, the tests just show that I *have* screwed up.

The question is, then, how to test that my output data is nice and random? I initially thought of using Jon Orwant's Statistics::ChiSquared module, but that has a couple of big drawbacks:

  • it thinks a coin that throws 500 heads followed by 500 tails is just fine and dandy;
  • it's limited to 21 discrete values because of the way its implemented
The second of those is a headache that can be worked around. The first, however, is a showstopper. That test can't detect certain types of obvious bias. So, what I'm looking for is a module that:
  • can determine whether data is evenly and randomly distributed across its range and is equally evenly distributed regardless of which part of the sample i look at (ie the first 20 values should be just as random as the next 100); and
  • can determine whether the data is at all predictable (ie can it detect that if the die rolls a 1 it's likely to roll a 4 three rolls later, or if it rolls a 1 it won't roll a 1 next time)

I'm not aware of anything on CPAN that can do that. An alternative would be - and we can do this because I'm only concerned about whether *I* am introducing bias, not with whether the data is biassed - to check that the distribution of my results is the same as the distribution of the original data. But I'm not aware of anything to do that either.

So, can anyone point me at any appropriate modules? Or at an algorithm that I could turn into a module?