I decided to Benchmark the original routine I presented, along
with lucs code, and a modification to it. Using the shuffle method is
clearly the least efficient, under every test case. lucs method is really
slick, and I was able to boost the speed with a little enhancing. I wasn't
up to trying to code up tye's method. Below are the benchmarks for the
various methods.
The speed difference is consistently 2.9 times from fastest to slowest. I
don't know how the shuffle() routine works, but it doesn't seem to be
affected by the size of the data set. Since I do know how lucs method
works, I didn't expect it to be non-linear (and it's not).
An interesting note in the 'test_jcwren' vs. 'test_jcwren2' tests. The only
difference is assigning 'undef' instead of 0 to the hash element. Since
exists() knows that 'undef' is a legal value for a hash value, we can shave
a significant amount of time off of 100,000 interations. In fact, simply
assigning 'undef' unstead of 0 account for 8 CPU seconds. I guess because
the interpeter doesn't have to create a scalar and do the conversion, it's
a little faster.
I've also left some code in their if you want to check the output and distributions of the random numbers. I hacked tilly's histogram code a previous node. All 4 implementations appear to have the same distribution, and produce values across the requires span.
100 Lines, 30 Questions
Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu
+csjcwren, test_lucsjcwren2...
test_jcwren: 73 wallclock secs (43.26 usr + 0.08 sys = 43.34 CPU
+)
test_lucs: 30 wallclock secs (17.27 usr + 0.01 sys = 17.28 CPU
+)
test_lucsjcwren: 28 wallclock secs (16.32 usr + 0.00 sys = 16.32 CPU
+)
test_lucsjcwren2: 26 wallclock secs (15.61 usr + 0.01 sys = 15.62 CPU
+)
500 Lines, 150 Questions
Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu
+csjcwren, test_lucsjcwren2...
test_jcwren: 379 wallclock secs (217.85 usr + 0.15 sys = 218.00
+CPU)
test_lucs: 145 wallclock secs ( 82.29 usr + 0.19 sys = 82.48
+CPU)
test_lucsjcwren: 137 wallclock secs ( 79.63 usr + 0.14 sys = 79.77
+CPU)
test_lucsjcwren2: 128 wallclock secs ( 75.22 usr + 0.02 sys = 75.24
+CPU)
1000 Lines, 300 Questions
Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu
+csjcwren, test_lucsjcwren2...
test_jcwren: 754 wallclock secs (438.51 usr + 0.50 sys = 439.01
+CPU)
test_lucs: 291 wallclock secs (168.96 usr + 0.12 sys = 169.08
+CPU)
test_lucsjcwren: 277 wallclock secs (162.44 usr + 0.04 sys = 162.48
+CPU)
test_lucsjcwren2: 267 wallclock secs (154.43 usr + 0.11 sys = 154.54
+CPU)
#!/usr/local/bin/perl -w
use strict;
use Algorithm::Numerical::Shuffle qw(shuffle);
use List::Util qw(min max sum);
use POSIX qw(ceil floor);
use Benchmark;
my $Max_Questions = 0;
my $Max_Lines = 0;
{
my $results;
my %hash;
for (([100,30]), ([500,150]), ([1000,300]))
{
$Max_Lines = @$_[0];
$Max_Questions = @$_[1];
printf ("%d Lines, %d Questions\n", $Max_Lines, $Max_Questions);
timethese (100000, {
'test_jcwren' => sub { test_jcwren ()
+ },
'test_lucs' => sub { test_lucs ()
+ },
'test_lucsjcwren' => sub { test_lucsjcwren
+() },
'test_lucsjcwren2' => sub { test_lucsjcwren2
+ () },
}
);
print "\n";
}
=cut
#
# Be sure to set $Max_Lines and $Max_Questions at the top
#
for (0..1000)
{
$results = test_lucsjcwren ();
$hash {$_}++ foreach (@$results);
}
show_array ($results);
show_histogram (1, \%hash);
=cut
}
#
# Test cases
#
sub test_jcwren
{
return \@{[(shuffle (0..$Max_Lines-1)) [0..$Max_Questions-1]]};
}
sub test_lucs
{
my $m = $Max_Questions;
my $j = $Max_Lines - $m + 1;
my %sample;
while ($m-- > 0)
{
my $val = int ($j * rand (1));
$sample{ exists $sample{$val} ? $j - 1 : $val } = 0;
++$j;
}
return (\@{[keys %sample]});
}
sub test_lucsjcwren
{
my $m = $Max_Questions;
my $j = $Max_Lines - $m + 1;
my %sample;
while ($m--)
{
my $val = int (rand ($j++));
$sample {exists $sample{$val} ? $j - 2 : $val} = 0;
}
return (\@{[keys %sample]});
}
sub test_lucsjcwren2
{
my $m = $Max_Questions;
my $j = $Max_Lines - $m + 1;
my %sample;
while ($m--)
{
my $val = int (rand ($j++));
$sample {exists $sample{$val} ? $j - 2 : $val} = undef;
}
return (\@{[keys %sample]});
}
#
# Utilities
#
sub show_array
{
my $array = shift;
my @sorted = sort {$a <=> $b} @$array;
print"Element\t\tUnsorted\tSorted\n";
print"-------\t\t--------\t------\n";
for (my $z = 0; $z < $Max_Questions; $z++)
{
print" $z";
print"\t\t @$array[$z]\t\t";
print" $sorted[$z]\n";
}
}
sub show_histogram
{
my $bin_size = shift;
my $articles = shift;
my $width = 50;
my $max_count = max (values %$articles);
my $scale = ceil($max_count / $width);
print " Index Count\n";
print "------------- -------", "-" x 50, "\n";
my @bins = sort {$a <=> $b} keys %$articles;
foreach my $bin (min(@bins)..max(@bins))
{
my $count = $articles->{$bin} || 0;
my $extra = ($count % $scale) ? '.' : '';
my $start = $bin * $bin_size;
my $end = $start + $bin_size - 1;
printf "%4d .. %4d \[%4d\] %s$extra\n",
$start, $end, $count, '#' x floor ($count / $scale);
}
print "\n Scale: #=$scale\n\n" if $scale > 1;
}
--Chris
e-mail jcwren |