in reply to Re: [OT]: Statistical significance?
in thread [OT]: Statistical significance?
The simple answer to your questions is I don't know.
This is the code that produced the reference data:
#! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ shuffle sum ]; use constant TARGET => 100; sub pick { my( $scoresRef, $pickPoint ) = @_; my $n = 0; ( $n += $scoresRef->[ $_ ] ) > $pickPoint and return $_ for 0 .. $#{ $scoresRef } } sub score { my( $result ) = @_; return TARGET / ( TARGET + abs( TARGET - $result ) ) * 3; } my %stats; for( 1 .. 1e3 ) { my @results = shuffle map $_*10, 1 .. 19; my @scores = map score( $_ ), @results; my $total = sum @scores; for ( 1 .. 1000 ) { my $picked = pick( \@scores, rand( $total ) ); ++$stats{ $results[ $picked ] }; } } my $total = sum values %stats; for my $key ( sort{ $a <=> $b } keys %stats ) { printf "%3d : %6d (%.3f%%)\n", $key, $stats{ $key }, $stats{ $key +} * 100 / $total; } __END__ c:\test>test.pl 10 : 39702 (3.970%) 20 : 41626 (4.163%) 30 : 44526 (4.453%) 40 : 47013 (4.701%) 50 : 49309 (4.931%) 60 : 53178 (5.318%) 70 : 57689 (5.769%) 80 : 62036 (6.204%) 90 : 67798 (6.780%) 100 : 74497 (7.450%) 110 : 67891 (6.789%) 120 : 62295 (6.229%) 130 : 57471 (5.747%) 140 : 53166 (5.317%) 150 : 49932 (4.993%) 160 : 47001 (4.700%) 170 : 43562 (4.356%) 180 : 41775 (4.178%) 190 : 39533 (3.953%)
My particular interest was in trying to understand how the scoring function interacted with the picking function to influence which values were chosen from the @results array.
The output shows that statistically, values closer to the TARGET value will be picked very slightly more often than those further away. But, the bias is (appears to me) to be so slight, that over a short number of picks--usually a few tens or low hundreds--the affect of that bias is almost negligible. Even exact matches having only slightly greater chance than values far away.
My thought is that as the picking process--relative to the rest of the processing--is fairly computationally expensive, that it would make more sense to either use a straight random pick which is much cheaper. Or, if biasing the pick in favour of close-to-target values actually benefits the rest of the (GA) algorithm, then it would be better to make the bias stronger; or the computation cheaper; or both.
For example, using this scoring function:
sub score { my( $result ) = @_; return 1 / abs( ( TARGET - $result ) || 1 ); }
Produces this PDF:
c:\test>867119-test.pl 10 : 7211 (0.721%) 20 : 8070 (0.807%) 30 : 8996 (0.900%) 40 : 10694 (1.069%) 50 : 12714 (1.271%) 60 : 16248 (1.625%) 70 : 21447 (2.145%) 80 : 32136 (3.214%) 90 : 63858 (6.386%) 100 : 637874 (63.787%) 110 : 63625 (6.362%) 120 : 32076 (3.208%) 130 : 21244 (2.124%) 140 : 15976 (1.598%) 150 : 12790 (1.279%) 160 : 10716 (1.072%) 170 : 9298 (0.930%) 180 : 7958 (0.796%) 190 : 7069 (0.707%)
But, in my tests, that doesn't seem to cause the GA to converge on the TARGET any more quickly than a uniformly random pick. But that left me wondering why the original author had chosen the scoring function he did, and I hoped one of the more stats-wise monks might see something in the distribution that would hint at the reasoning.
|
|---|