(jcwren) Re: Creating an array of unique numbers (Benchmarks!)

I decided to Benchmark the original routine I presented, along with lucs code, and a modification to it. Using the shuffle method is clearly the least efficient, under every test case. lucs method is really slick, and I was able to boost the speed with a little enhancing. I wasn't up to trying to code up tye's method. Below are the benchmarks for the various methods.

The speed difference is consistently 2.9 times from fastest to slowest. I don't know how the shuffle() routine works, but it doesn't seem to be affected by the size of the data set. Since I do know how lucs method works, I didn't expect it to be non-linear (and it's not).

An interesting note in the 'test_jcwren' vs. 'test_jcwren2' tests. The only difference is assigning 'undef' instead of 0 to the hash element. Since exists() knows that 'undef' is a legal value for a hash value, we can shave a significant amount of time off of 100,000 interations. In fact, simply assigning 'undef' unstead of 0 account for 8 CPU seconds. I guess because the interpeter doesn't have to create a scalar and do the conversion, it's a little faster.

I've also left some code in their if you want to check the output and distributions of the random numbers. I hacked tilly's histogram code a previous node. All 4 implementations appear to have the same distribution, and produce values across the requires span.

100 Lines, 30 Questions
Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu
+csjcwren, test_lucsjcwren2...
     test_jcwren: 73 wallclock secs (43.26 usr +  0.08 sys = 43.34 CPU
+)
       test_lucs: 30 wallclock secs (17.27 usr +  0.01 sys = 17.28 CPU
+)
 test_lucsjcwren: 28 wallclock secs (16.32 usr +  0.00 sys = 16.32 CPU
+)
test_lucsjcwren2: 26 wallclock secs (15.61 usr +  0.01 sys = 15.62 CPU
+)
 
500 Lines, 150 Questions
Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu
+csjcwren, test_lucsjcwren2...
     test_jcwren: 379 wallclock secs (217.85 usr +  0.15 sys = 218.00 
+CPU)
       test_lucs: 145 wallclock secs ( 82.29 usr +  0.19 sys =  82.48 
+CPU)
 test_lucsjcwren: 137 wallclock secs ( 79.63 usr +  0.14 sys =  79.77 
+CPU)
test_lucsjcwren2: 128 wallclock secs ( 75.22 usr +  0.02 sys =  75.24 
+CPU)
 
1000 Lines, 300 Questions
Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu
+csjcwren, test_lucsjcwren2...
     test_jcwren: 754 wallclock secs (438.51 usr +  0.50 sys = 439.01 
+CPU)
       test_lucs: 291 wallclock secs (168.96 usr +  0.12 sys = 169.08 
+CPU)
 test_lucsjcwren: 277 wallclock secs (162.44 usr +  0.04 sys = 162.48 
+CPU)
test_lucsjcwren2: 267 wallclock secs (154.43 usr +  0.11 sys = 154.54 
+CPU)
[download]

#!/usr/local/bin/perl -w

use strict;
use Algorithm::Numerical::Shuffle qw(shuffle);
use List::Util qw(min max sum);
use POSIX qw(ceil floor);
use Benchmark;

my $Max_Questions = 0;
my $Max_Lines = 0;

{
   my $results;
   my %hash;

   for (([100,30]), ([500,150]), ([1000,300]))
   {
      $Max_Lines = @$_[0];
      $Max_Questions = @$_[1];

      printf ("%d Lines, %d Questions\n", $Max_Lines, $Max_Questions);

      timethese (100000, {
                          'test_jcwren'      => sub { test_jcwren ()  
+     },
                          'test_lucs'        => sub { test_lucs ()    
+     },
                          'test_lucsjcwren'  => sub { test_lucsjcwren 
+()   },
                          'test_lucsjcwren2' => sub { test_lucsjcwren2
+ ()  },
                         }
                );

      print "\n";
   }

=cut
   #
   #  Be sure to set $Max_Lines and $Max_Questions at the top
   #
   for (0..1000)
   {
      $results = test_lucsjcwren ();
      $hash {$_}++ foreach (@$results);
   }

   show_array ($results);
   show_histogram (1, \%hash);
=cut
}

#
#  Test cases
#
sub test_jcwren
{
   return \@{[(shuffle (0..$Max_Lines-1)) [0..$Max_Questions-1]]};
}

sub test_lucs
{
   my $m = $Max_Questions;
   my $j = $Max_Lines - $m + 1;
   my %sample;

   while ($m-- > 0)
   {
      my $val = int ($j * rand (1));
      $sample{ exists $sample{$val} ? $j - 1 : $val } = 0;
      ++$j;
   }

   return (\@{[keys %sample]});
}

sub test_lucsjcwren
{
   my $m = $Max_Questions;
   my $j = $Max_Lines - $m + 1;
   my %sample;

   while ($m--)
   {
      my $val = int (rand ($j++));

      $sample {exists $sample{$val} ? $j - 2 : $val} = 0;
   }

   return (\@{[keys %sample]});
}

sub test_lucsjcwren2
{
   my $m = $Max_Questions;
   my $j = $Max_Lines - $m + 1;
   my %sample;

   while ($m--)
   {
      my $val = int (rand ($j++));

      $sample {exists $sample{$val} ? $j - 2 : $val} = undef;
   }

   return (\@{[keys %sample]});
}

#
#  Utilities
#
sub show_array
{
   my $array = shift;

   my @sorted = sort {$a <=> $b} @$array;

   print"Element\t\tUnsorted\tSorted\n";
   print"-------\t\t--------\t------\n";

   for (my $z = 0; $z < $Max_Questions; $z++)
   {
      print"   $z";
      print"\t\t    @$array[$z]\t\t";
      print"   $sorted[$z]\n";
   }
}

sub show_histogram
{
   my $bin_size = shift;
   my $articles = shift;

   my $width = 50;
   my $max_count = max (values %$articles);
   my $scale = ceil($max_count / $width);

   print "  Index        Count\n";
   print "------------- -------", "-" x 50, "\n";

   my @bins = sort {$a <=> $b} keys %$articles;

   foreach my $bin (min(@bins)..max(@bins))
   {
      my $count = $articles->{$bin} || 0;
      my $extra = ($count % $scale) ? '.' : '';
      my $start = $bin * $bin_size;
      my $end = $start + $bin_size - 1;
      printf "%4d .. %4d  \[%4d\] %s$extra\n",
      $start, $end, $count, '#' x floor ($count / $scale);
   }

   print "\n  Scale: #=$scale\n\n" if $scale > 1;
}
[download]

--Chris

e-mail jcwren

Comment on (jcwren) Re: Creating an array of unique numbers (Benchmarks!) Select or Download Code

Replies are listed 'Best First'.
Re: (jcwren) Re: Creating an array of unique numbers (Benchmarks!) by archon (Monk) on Mar 28, 2001 at 02:14 UTC
You made me curious, so I tested the algorithm I proposed and came up with these results: 100 Lines, 30 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 6 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) test_jcwren: 19 wallclock secs (17.80 usr + 0.02 sys = 17.82 CPU) test_lucs: 7 wallclock secs ( 6.30 usr + 0.00 sys = 6.30 CPU) test_lucsjcwren: 6 wallclock secs ( 5.84 usr + 0.00 sys = 5.84 CPU) test_lucsjcwren2: 6 wallclock secs ( 5.55 usr + 0.01 sys = 5.56 CPU +) 500 Lines, 150 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 28 wallclock secs (26.94 usr + 0.01 sys = 26.95 CPU) test_jcwren: 93 wallclock secs (88.83 usr + 0.11 sys = 88.94 CPU) test_lucs: 32 wallclock secs (30.66 usr + 0.00 sys = 30.66 CPU) test_lucsjcwren: 30 wallclock secs (28.46 usr + 0.02 sys = 28.48 CPU) test_lucsjcwren2: 29 wallclock secs (26.97 usr + 0.03 sys = 27.00 CPU +) 1000 Lines, 300 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 58 wallclock secs (54.92 usr + 0.05 sys = 54.97 CPU) test_jcwren: 189 wallclock secs (179.66 usr + 0.13 sys = 179.79 CPU) test_lucs: 66 wallclock secs (63.03 usr + 0.05 sys = 63.09 CPU) test_lucsjcwren: 61 wallclock secs (58.53 usr + 0.05 sys = 58.58 CPU) test_lucsjcwren2: 57 wallclock secs (54.80 usr + 0.05 sys = 54.85 CPU +) [download] test_archon: `sub test_archon { my %numhash; for (1 .. $Max_Questions) { my $randnum = int(rand($Max_Lines) + 1); redo if exists ($numhash{$randnum}); $numhash{$randnum} = 1; } return [keys (%numhash)]; }` [download] Edit 2001-03-26 by tye to remove `<pre>`	[reply] [d/l] [select]
(tye)Re2: Creating an array of unique numbers (Benchmarks!) by tye (Sage) on Mar 28, 2001 at 02:38 UTC
You need to try your algorythm with a large list when you want all (or even nearly all) of the question numbers! (: - tye (but my friends call me "Tye")	[reply]
Re: (tye)Re2: Creating an array of unique numbers (Benchmarks!) by archon (Monk) on Mar 28, 2001 at 03:38 UTC
I expected much worse results, but here are the results of grabbing 900 unique numbers from a list of 1000: `Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 259 wallclock secs (242.45 usr + 0.14 sys = 242.59 CPU) test_jcwren: 213 wallclock secs (202.54 usr + 0.16 sys = 202.70 CPU) test_lucs: 285 wallclock secs (267.02 usr + 0.27 sys = 267.28 CPU) test_lucsjcwren: 268 wallclock secs (248.88 usr + 0.24 sys = 249.12 C +PU) test_lucsjcwren2: 255 wallclock secs (231.91 usr + 0.21 sys = 232.12 +CPU)` [download] This is where jcwren's algorithm begins to outperform lucs and mine. I'm surprised mine is still about as fast as lucs', though.	[reply] [d/l]
Re (tilly) 5: Creating an array of unique numbers (Benchmarks!) by tilly (Archbishop) on Mar 28, 2001 at 09:30 UTC