daverave has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

The following piece of code randomly maps a set of ranges onto a circumference of a circle. In the example, the circumference is of length 1000 and legal ranges are e.g. (0,8)=0,1,2,...,8 and (995,2)=995,996,...,999,0,1,2 (i.e. zero-based coordinates; both start and end are inclusive).

I take some arbitrary position on the circumference (e.g. 36) and count how many ranges cover it in each simulation.

finally, I calculate the mean and variance of this statistic.

use strict; use warnings; use Statistics::Descriptive; my $n_simulations = 1000; my $circumference = 1000; my @lengths_distrib = (100) x 100; # distibution of range lengt +hs my $some_pos = 36; # arbitrary position my $stat = Statistics::Descriptive::Full->new(); foreach my $sim ( 1 .. $n_simulations ) { # randomly map ranges onto circumference my @random_ranges = map { my $start = int( rand($circumference) ); [ $start, ( + $start + $_ -1 ) % $circumference ] } @lengths_distrib; # count how many range contain $some_pos my $num_covering_ranges = scalar( grep { ( $_->[0] <= $some_pos and $_->[1] >= $some_pos ) o +r ( $_->[1] < $_->[0] and $_->[1] > $some_pos ) } @random_ranges ); $stat->add_data($num_covering_ranges); } print $stat->mean, ' ', $stat->variance, "\n";

To the best of my knowledge, this kind of random variable should follow Poisson distribution (law of rare events and so on). Hence, the mean and variance should be equal. However, the variance seems to systematically be a bit lower than the mean.

What am I missing?

UPDATE: this question is now also shared with the guys at CrossValidated (http://stats.stackexchange.com/questions/5005/).

Replies are listed 'Best First'.
Re: Unexpected under-dispersion in random simulations
by BrowserUk (Patriarch) on Nov 29, 2010 at 22:18 UTC

    I think that your range testing is suspect.

    1. The test for the inverted range should at least be: ( $_->[1] < $_->[0] and $_->[1] >= $some_pos ) }

      That is, the test should be that the end is less than or equal to $some_pos

    2. Also, whilst the test works for your current arbitrary chosen position, 36, it would fail when that arbitrary position is greater than 900.

      Ie. If $some_pos = 950; then a range of [0, 901] would fail:

      $some_pos = 950; print +( $_->[1] < $_->[0] and $_->[1] >= $some_pos ) ? 'contained' : 'not contained' for [0, 901];; not contained

    However, neither of those affect the discrepancy you note.

    I also thought that using a better rand() might change things, but as you note, Math::Random::MT shows the same discrepancy.

    Then I thought perhaps the difference might be due to sample .v. population variance; and discovered that there appears to be an undocumented parameter to the variance() method that might be intended to affect the calculation:

    sub variance { my $self = shift; ##Myself my $div = @_ ? 0 : 1; ... $variance /= $count - $div;

    Ie. If you pass any parameter(s) to the variance method, it appears (I think) to calculate the population rather than the sample variance? Update: the variations are the biased versus unbiased estimates of variance.

    But as expected, whilst it does make some slight difference, it is very small:

    [21:59:37.22] C:\test>874353 10000: 10.003 9.138 10.0028 9.13839215999999 [21:59:44.34] C:\test>874353 10000: 10.003 9.139 10.0028 9.13930609060905

    At this point, I see two possibilities that I don't have answers to, or the knowledge to verify:

    • Your understanding that the mean and variance should be the same might be wrong.

      I don't recall ever seeing it stated that variance and mean should coincide for such data?

    • Your understanding is correct, but that it is not applicable to the particular mechanics of your test.

      Could the calculation of the variance be affected by a) the inclusive nature of the ranges; or b) the wrap-around?

    Not much help, but it might trigger some thoughts somewhere.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      While I accept you first correction (>= instead of >), I don't understand the second point. 950 is really not contained in 0,901.

        Sorry. I swapped the ends around when posting. That should [901,0], but the point still stands:

        $some_pos = 950; print +( $_->[1] < $_->[0] and $_->[1] >= $some_pos ) ? 'contained' : 'not contained' for [901,0];; not contained

        A range of 100 starting at 901 will wrap around to end at position 0, thereby covering position 950; but your test fails to detect it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Unexpected under-dispersion in random simulations
by blakew (Monk) on Nov 29, 2010 at 23:43 UTC
    Looking closer at the law of rare events it seems that possibly you're setting the window too large and/or the circumferance too small to see the limit convergence to the Poisson distribution. Your description refers to an 9-size window but the code uses 10% of the circumference (100). Fixing the range test per BrowserUK's post and setting the window to 9 seems to pull up similar numbers for mean and variance.
Re: Unexpected under-dispersion in random simulations
by Illuminatus (Curate) on Nov 29, 2010 at 21:00 UTC
    I believe the built-in rand perl function is based on the underlying pseudo-random generator. I would suggest using Math::Random to generate better values

    fnord

      I have tried using Math::Random::MT::Auto but the results were similar.

        In that case the fault is in your code unless Math::Random::MT::Auto is a bad implementation. You could try Math::Random::MT instead which, on a cursory inspection, should pretty much be a drop in replacement. The Mersenne Twister is an excellent PNG for most purposes.

        True laziness is hard work