comment on

Apart from time benchmarks, the suitability of the shuffle algorithm must be assessed with respect to the quality of the randomness of the shuffled array. One way to do this is to calculate the auto-correlation of the shuffled sequence with lag 1 (looking at consecutive elements). The absolute value of the a-c coefficient approaches 1 when the sequence is highly auto-correlated (for example the test array 1..1000) and zero when the opposite happens. So, a good quality shuffle should produce auto-correlations approaching zero.

Edit: suggested test scenario: start with a highly correlated array (e.g 1..1000: perl -MStatistics::Autocorrelation -e 'print Statistics::Autocorrelation->new()->coefficient(data=>[1..1000],lag=>1)."\n"' yields 0.997) and see how the shuffling algorithm de-auto-correlates it by lowering its auto-correlation coefficient towards zero.

Edit 2: auto-correlation coefficient is in the range -1 to 1. Both extremes are for higlhy auto-correlated sequences and zero for no auto-correlation. In this test I take the absolute value of the coefficient.

The following script compares the three methods mentioned here by BrowserUK, tybalt89, List::Util/shuffle with respect to auto-correlation and also, for each trial it plots a histogram of the differences between consecutive elements of the shuffled array, just for fun.

The best shuffle is the one who produces the lowest mean auto-correlation with lowest variance and most successes (i.e. it had the minimum auto-correlation at a specific trial).

./fisher_yates.pl : after 5000 trials shuffling arrays of size 1000:
List::Util::shuffle : 1693 successes, mean:0.0105896962736892, stdev:0
+.00900688731621982
BUK : 1685 successes, mean:0.010799062825769, stdev:0.0092140346941260
+4
tybalt89 : 1622 successes, mean:0.0102906705829024, stdev:0.0084376063
+2828801
[download]

once more:

./fisher_yates.pl : after 5000 trials shuffling arrays of size 1000:
BUK : 1696 successes, mean:0.0104235933728858, stdev:0.008974970557612
+36
List::Util::shuffle : 1690 successes, mean:0.0106133000677379, stdev:0
+.00908235156157047
tybalt89 : 1614 successes, mean:0.0100835174626996, stdev:0.0089795531
+9759652
[download]

once more:

./fisher_yates.pl : after 5000 trials shuffling arrays of size 1000:
List::Util::shuffle : 1690 successes, mean:0.0104611128054915, stdev:0
+.00886345338184372
BUK : 1658 successes, mean:0.0102429744950854, stdev:0.008480381381372
+49
tybalt89 : 1652 successes, mean:0.0105683142305418, stdev:0.0089906156
+3593633
[download]

My opinion: all algorithms work well with respect to randomness (as assessed by auto-correlation) and now we can move to time benchmarks.

TODO: try with a different random number generator (i.e. more reliably uniform).

The test program:

#!/usr/bin/env perl

use strict;
use warnings;

use Statistics::Histogram;
use Statistics::Autocorrelation;
use Statistics::Descriptive;
use List::Util qw/shuffle/;

my $N = 1000;
my $trials = 50;
my %mins = ();
for(1..$trials){
        my $res = assess_once();
        if( exists $mins{$res->[0]} ){
                push(@{$mins{$res->[0]}}, $res->[1]->[1]);
        } else {
                $mins{$res->[0]} = [$res->[1]->[1]];
        }
}
print "$0 : after $trials trials shuffling arrays of size $N:\n";
foreach (keys %mins){
        my @re = @{$mins{$_}};
        my $stats = Statistics::Descriptive::Full->new();
        $stats->add_data(@re);
        print $_." : ".scalar(@re)." successes, mean:".$stats->mean().
+", stdev:".$stats->standard_deviation()."\n";
}

sub     assess_once {
        my %results = ();

        my @array = 1..$N;
        shuffleAry_1( \@array );
        $results{'BUK'} = [
                histo_of_differences(\@array),
                corello_abs(\@array)
        ];

        @array = List::Util::shuffle(1..$N);
        $results{'List::Util::shuffle'} = [
                histo_of_differences(\@array),
                corello_abs(\@array)
        ];

        @array = 1..$N;
        @array = @{shuffleAry_2( \@array )};
        $results{'tybalt89'} = [
                histo_of_differences(\@array),
                corello_abs(\@array)
        ];

        my @keys_sorted_autocor_desc =
                sort { $results{$a}->[1] <=> $results{$b}->[1] } keys 
+%results;
        foreach (@keys_sorted_autocor_desc){
                my $hist = $results{$_}->[0];
                my $autocor = $results{$_}->[1];
                print $_.") Autocorrelation coefficient: ".$autocor."\
+n";
                print $_.") Histogram of the differences of consecutiv
+e elements:\n".$hist."\n";
                print "--------------------------\n\n\n";
        }
        foreach (@keys_sorted_autocor_desc){
                my $hist = $results{$_}->[0];
                my $autocor = $results{$_}->[1];
                print $_.") Autocorrelation coefficient: ".$autocor."\
+n";
        }
        print "assess() : minimum autocorrelation coeff is "
                .$results{$keys_sorted_autocor_desc[0]}->[1]
                ." for ".$keys_sorted_autocor_desc[0]
                ."\n";
        print "assess() : done\n";
        return [$keys_sorted_autocor_desc[0], $results{$keys_sorted_au
+tocor_desc[0]}]
}
exit(0);

sub shuffleAry_2 {
        my $arr = $_[0];
        return [
                map $_->[0],
                sort { $a->[1] <=> $b->[1] }
                map [ $_, rand ], @{$arr}
        ]
}
sub shuffleAry_1 {
    die 'Need array reference' unless ref( $_[0] ) eq 'ARRAY';
    our( @aliased, $a, $b ); local( *aliased, $a, $b ) = $_[0];

        $a = $_ + rand @aliased - $_,
        $b = $aliased[ $_ ],
        $aliased[ $_ ] = $aliased[ $a ],
        $aliased[ $a ] = $b
                for 0 .. $#aliased;
        return;
}
sub     corello_abs {
        my $arr = $_[0];
        my $acorr = Statistics::Autocorrelation->new();
        return abs(
                $acorr->coefficient(
                        data => $arr,
                        lag=>1
                )
        )
}
sub     histo_of_differences {
        my $arr = $_[0];
        my $N = $#$arr;
        my @diffs = (0)x($N);
        for(1..$N){
                $diffs[$_-1] = abs($arr->[$_] - $arr->[$_-1]);
        }
        return Statistics::Histogram::get_histogram(\@diffs);
}
[download]

In reply to Re: Shuffling CODONS by bliako
in thread Shuffling CODONS by WouterVG

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.