BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given the following set of data--value picked;frequency picked;percentage of total--is there anything that can meaningfully be derived from it about the picking process?

c:\test>test.pl 10 : 39663 (3.966%) #### 20 : 41281 (4.128%) #### 30 : 43552 (4.355%) #### 40 : 46839 (4.684%) #### 50 : 50217 (5.022%) ##### 60 : 53097 (5.310%) ##### 70 : 57457 (5.746%) ##### 80 : 61963 (6.196%) ###### 90 : 68065 (6.806%) ###### 100 : 74738 (7.474%) ####### 110 : 68005 (6.801%) ###### 120 : 62216 (6.222%) ###### 130 : 57352 (5.735%) ##### 140 : 53747 (5.375%) ##### 150 : 49963 (4.996%) #### 160 : 46435 (4.644%) #### 170 : 44099 (4.410%) #### 180 : 41758 (4.176%) #### 190 : 39553 (3.955%) ####

It obviously isn't a straight random pick otherwise the distribution would be more even. It's also not quite a classic bell curve.

The process that produces this distribution is random (ie. it uses rand()), but it's more complicated than just $pick = $n[ rand @n ];.

The thing I'm trying to resolve is why is it more complicated? There is a biased towards picking the median value; but is that bias significant?


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re: [OT]: Statistical significance?
by jjap (Monk) on Dec 21, 2010 at 04:14 UTC
    If your looking to find if this distribution is likely a uniform distribution, this looks like a job for Chi square test.
    Dealing with absolute values (updated from "percentages" to meet what I posted):
    Sum of ((observed - expected)^2)/expected) dividing by df (number of degrees of freedom
    Since you have 19 classes, df = 18 the Chi square statistic is: X-squared = 38297.18, df = 18, p-value < 2.2e-16
    And that very small p-value indicates it does not depart significantly from a uniform distribution...
    Updated originally got it backwards, as Anonymous Monk pointed out. Hence the following also became a moot point
    However if the the same hump was to be observed over and over again, then my suggested approach was probably not the right one to follow, a real statistician might have a better insight. (That last part still holds ;-)
      that very small p-value indicates it does not depart significantly from a uniform distribution

      I think you got the interpretation wrong. Normally, one would reject the null hypothesis (=uniform distribution here) if the p-value is less than 0.05 or 0.01, corresponding to 5% or 1% significance level (i.e. the error probability of incorrectly rejecting the null hypothesis despite it being true). In other words, the deviations from a uniform distribution are statistically highly significant.

      Many thanks to you (and AnonyRNous monk, whomever he should be), for your prompting me in the direction of Chi Squared, the Null hypothesis and all that jazz.

      After reading lots, much of which probably went over my head, I settled for an emperical study of comparing the results of 1e6 runs with a) a uniform (simple $data[ rand @data ]) pick; and b) the slightly non-uniform pick the datapoints I posted represented. The upshot was I could discern no difference in the final results of the algorithms.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: [OT]: Statistical significance?
by Anonymous Monk on Dec 22, 2010 at 09:00 UTC
    Picking things at random doesn't have to mean they fit a uniform PDF. Is there any reason to think this distribution only makes sense for a finite range of values? Is this distribution actually the sum of a uniform distribution, and some kind of peaked distribution? Neither tail of your data is tending to 0 quickly, perhaps there is an underlying distribution providing 39k counts per channel? Over a finite range, a sum of uniform and beta distribution might look like this. Is a resonance phenomenon at work? A Cauchy distribution (one name for it, which happens to have a Perl module for it) has a mode and median, but is so heavily tailed it has no central moments.

      The simple answer to your questions is I don't know.

      This is the code that produced the reference data:

      #! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ shuffle sum ]; use constant TARGET => 100; sub pick { my( $scoresRef, $pickPoint ) = @_; my $n = 0; ( $n += $scoresRef->[ $_ ] ) > $pickPoint and return $_ for 0 .. $#{ $scoresRef } } sub score { my( $result ) = @_; return TARGET / ( TARGET + abs( TARGET - $result ) ) * 3; } my %stats; for( 1 .. 1e3 ) { my @results = shuffle map $_*10, 1 .. 19; my @scores = map score( $_ ), @results; my $total = sum @scores; for ( 1 .. 1000 ) { my $picked = pick( \@scores, rand( $total ) ); ++$stats{ $results[ $picked ] }; } } my $total = sum values %stats; for my $key ( sort{ $a <=> $b } keys %stats ) { printf "%3d : %6d (%.3f%%)\n", $key, $stats{ $key }, $stats{ $key +} * 100 / $total; } __END__ c:\test>test.pl 10 : 39702 (3.970%) 20 : 41626 (4.163%) 30 : 44526 (4.453%) 40 : 47013 (4.701%) 50 : 49309 (4.931%) 60 : 53178 (5.318%) 70 : 57689 (5.769%) 80 : 62036 (6.204%) 90 : 67798 (6.780%) 100 : 74497 (7.450%) 110 : 67891 (6.789%) 120 : 62295 (6.229%) 130 : 57471 (5.747%) 140 : 53166 (5.317%) 150 : 49932 (4.993%) 160 : 47001 (4.700%) 170 : 43562 (4.356%) 180 : 41775 (4.178%) 190 : 39533 (3.953%)

      My particular interest was in trying to understand how the scoring function interacted with the picking function to influence which values were chosen from the @results array.

      The output shows that statistically, values closer to the TARGET value will be picked very slightly more often than those further away. But, the bias is (appears to me) to be so slight, that over a short number of picks--usually a few tens or low hundreds--the affect of that bias is almost negligible. Even exact matches having only slightly greater chance than values far away.

      My thought is that as the picking process--relative to the rest of the processing--is fairly computationally expensive, that it would make more sense to either use a straight random pick which is much cheaper. Or, if biasing the pick in favour of close-to-target values actually benefits the rest of the (GA) algorithm, then it would be better to make the bias stronger; or the computation cheaper; or both.

      For example, using this scoring function:

      sub score { my( $result ) = @_; return 1 / abs( ( TARGET - $result ) || 1 ); }

      Produces this PDF:

      c:\test>867119-test.pl 10 : 7211 (0.721%) 20 : 8070 (0.807%) 30 : 8996 (0.900%) 40 : 10694 (1.069%) 50 : 12714 (1.271%) 60 : 16248 (1.625%) 70 : 21447 (2.145%) 80 : 32136 (3.214%) 90 : 63858 (6.386%) 100 : 637874 (63.787%) 110 : 63625 (6.362%) 120 : 32076 (3.208%) 130 : 21244 (2.124%) 140 : 15976 (1.598%) 150 : 12790 (1.279%) 160 : 10716 (1.072%) 170 : 9298 (0.930%) 180 : 7958 (0.796%) 190 : 7069 (0.707%)

      But, in my tests, that doesn't seem to cause the GA to converge on the TARGET any more quickly than a uniformly random pick. But that left me wondering why the original author had chosen the scoring function he did, and I hoped one of the more stats-wise monks might see something in the distribution that would hint at the reasoning.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.