Re^2: [OT]: Statistical significance?

The simple answer to your questions is I don't know.

This is the code that produced the reference data:

#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use List::Util qw[ shuffle sum ];

use constant TARGET => 100;

sub pick {
    my( $scoresRef, $pickPoint ) = @_;
    my $n = 0;
    ( $n += $scoresRef->[ $_ ] ) > $pickPoint 
       and return $_ 
       for 0 .. $#{ $scoresRef }
}

sub score {
    my( $result ) = @_;
    return TARGET / ( TARGET + abs( TARGET - $result ) ) * 3;
}

my %stats;
for( 1 .. 1e3 ) {
    my @results = shuffle map $_*10, 1 .. 19;
    my @scores = map score( $_ ), @results;
    my $total = sum @scores;
    for ( 1 .. 1000 ) {
        my $picked = pick( \@scores, rand( $total ) );
        ++$stats{ $results[ $picked ] };
    }
}

my $total = sum values %stats;
for my $key ( sort{ $a <=> $b } keys %stats ) {
    printf "%3d : %6d (%.3f%%)\n", $key, $stats{ $key }, $stats{ $key 
+} * 100 / $total;
}

__END__
c:\test>test.pl
 10 :  39702 (3.970%)
 20 :  41626 (4.163%)
 30 :  44526 (4.453%)
 40 :  47013 (4.701%)
 50 :  49309 (4.931%)
 60 :  53178 (5.318%)
 70 :  57689 (5.769%)
 80 :  62036 (6.204%)
 90 :  67798 (6.780%)
100 :  74497 (7.450%)
110 :  67891 (6.789%)
120 :  62295 (6.229%)
130 :  57471 (5.747%)
140 :  53166 (5.317%)
150 :  49932 (4.993%)
160 :  47001 (4.700%)
170 :  43562 (4.356%)
180 :  41775 (4.178%)
190 :  39533 (3.953%)
[download]

My particular interest was in trying to understand how the scoring function interacted with the picking function to influence which values were chosen from the @results array.

The output shows that statistically, values closer to the TARGET value will be picked very slightly more often than those further away. But, the bias is (appears to me) to be so slight, that over a short number of picks--usually a few tens or low hundreds--the affect of that bias is almost negligible. Even exact matches having only slightly greater chance than values far away.

My thought is that as the picking process--relative to the rest of the processing--is fairly computationally expensive, that it would make more sense to either use a straight random pick which is much cheaper. Or, if biasing the pick in favour of close-to-target values actually benefits the rest of the (GA) algorithm, then it would be better to make the bias stronger; or the computation cheaper; or both.

For example, using this scoring function:

sub score {
    my( $result ) = @_;
    return 1 / abs( ( TARGET - $result ) || 1 );
}
[download]

Produces this PDF:

c:\test>867119-test.pl
 10 :   7211 (0.721%)
 20 :   8070 (0.807%)
 30 :   8996 (0.900%)
 40 :  10694 (1.069%)
 50 :  12714 (1.271%)
 60 :  16248 (1.625%)
 70 :  21447 (2.145%)
 80 :  32136 (3.214%)
 90 :  63858 (6.386%)
100 : 637874 (63.787%)
110 :  63625 (6.362%)
120 :  32076 (3.208%)
130 :  21244 (2.124%)
140 :  15976 (1.598%)
150 :  12790 (1.279%)
160 :  10716 (1.072%)
170 :   9298 (0.930%)
180 :   7958 (0.796%)
190 :   7069 (0.707%)
[download]

But, in my tests, that doesn't seem to cause the GA to converge on the TARGET any more quickly than a uniformly random pick. But that left me wondering why the original author had chosen the scoring function he did, and I hoped one of the more stats-wise monks might see something in the distribution that would hint at the reasoning.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

Comment on Re^2: [OT]: Statistical significance? Select or Download Code