in reply to Creating an array of unique numbers

I decided to Benchmark the original routine I presented, along with lucs code, and a modification to it. Using the shuffle method is clearly the least efficient, under every test case. lucs method is really slick, and I was able to boost the speed with a little enhancing. I wasn't up to trying to code up tye's method. Below are the benchmarks for the various methods.

The speed difference is consistently 2.9 times from fastest to slowest. I don't know how the shuffle() routine works, but it doesn't seem to be affected by the size of the data set. Since I do know how lucs method works, I didn't expect it to be non-linear (and it's not).

An interesting note in the 'test_jcwren' vs. 'test_jcwren2' tests. The only difference is assigning 'undef' instead of 0 to the hash element. Since exists() knows that 'undef' is a legal value for a hash value, we can shave a significant amount of time off of 100,000 interations. In fact, simply assigning 'undef' unstead of 0 account for 8 CPU seconds. I guess because the interpeter doesn't have to create a scalar and do the conversion, it's a little faster.

I've also left some code in their if you want to check the output and distributions of the random numbers. I hacked tilly's histogram code a previous node. All 4 implementations appear to have the same distribution, and produce values across the requires span.

100 Lines, 30 Questions Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu +csjcwren, test_lucsjcwren2... test_jcwren: 73 wallclock secs (43.26 usr + 0.08 sys = 43.34 CPU +) test_lucs: 30 wallclock secs (17.27 usr + 0.01 sys = 17.28 CPU +) test_lucsjcwren: 28 wallclock secs (16.32 usr + 0.00 sys = 16.32 CPU +) test_lucsjcwren2: 26 wallclock secs (15.61 usr + 0.01 sys = 15.62 CPU +) 500 Lines, 150 Questions Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu +csjcwren, test_lucsjcwren2... test_jcwren: 379 wallclock secs (217.85 usr + 0.15 sys = 218.00 +CPU) test_lucs: 145 wallclock secs ( 82.29 usr + 0.19 sys = 82.48 +CPU) test_lucsjcwren: 137 wallclock secs ( 79.63 usr + 0.14 sys = 79.77 +CPU) test_lucsjcwren2: 128 wallclock secs ( 75.22 usr + 0.02 sys = 75.24 +CPU) 1000 Lines, 300 Questions Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu +csjcwren, test_lucsjcwren2... test_jcwren: 754 wallclock secs (438.51 usr + 0.50 sys = 439.01 +CPU) test_lucs: 291 wallclock secs (168.96 usr + 0.12 sys = 169.08 +CPU) test_lucsjcwren: 277 wallclock secs (162.44 usr + 0.04 sys = 162.48 +CPU) test_lucsjcwren2: 267 wallclock secs (154.43 usr + 0.11 sys = 154.54 +CPU)
#!/usr/local/bin/perl -w use strict; use Algorithm::Numerical::Shuffle qw(shuffle); use List::Util qw(min max sum); use POSIX qw(ceil floor); use Benchmark; my $Max_Questions = 0; my $Max_Lines = 0; { my $results; my %hash; for (([100,30]), ([500,150]), ([1000,300])) { $Max_Lines = @$_[0]; $Max_Questions = @$_[1]; printf ("%d Lines, %d Questions\n", $Max_Lines, $Max_Questions); timethese (100000, { 'test_jcwren' => sub { test_jcwren () + }, 'test_lucs' => sub { test_lucs () + }, 'test_lucsjcwren' => sub { test_lucsjcwren +() }, 'test_lucsjcwren2' => sub { test_lucsjcwren2 + () }, } ); print "\n"; } =cut # # Be sure to set $Max_Lines and $Max_Questions at the top # for (0..1000) { $results = test_lucsjcwren (); $hash {$_}++ foreach (@$results); } show_array ($results); show_histogram (1, \%hash); =cut } # # Test cases # sub test_jcwren { return \@{[(shuffle (0..$Max_Lines-1)) [0..$Max_Questions-1]]}; } sub test_lucs { my $m = $Max_Questions; my $j = $Max_Lines - $m + 1; my %sample; while ($m-- > 0) { my $val = int ($j * rand (1)); $sample{ exists $sample{$val} ? $j - 1 : $val } = 0; ++$j; } return (\@{[keys %sample]}); } sub test_lucsjcwren { my $m = $Max_Questions; my $j = $Max_Lines - $m + 1; my %sample; while ($m--) { my $val = int (rand ($j++)); $sample {exists $sample{$val} ? $j - 2 : $val} = 0; } return (\@{[keys %sample]}); } sub test_lucsjcwren2 { my $m = $Max_Questions; my $j = $Max_Lines - $m + 1; my %sample; while ($m--) { my $val = int (rand ($j++)); $sample {exists $sample{$val} ? $j - 2 : $val} = undef; } return (\@{[keys %sample]}); } # # Utilities # sub show_array { my $array = shift; my @sorted = sort {$a <=> $b} @$array; print"Element\t\tUnsorted\tSorted\n"; print"-------\t\t--------\t------\n"; for (my $z = 0; $z < $Max_Questions; $z++) { print" $z"; print"\t\t @$array[$z]\t\t"; print" $sorted[$z]\n"; } } sub show_histogram { my $bin_size = shift; my $articles = shift; my $width = 50; my $max_count = max (values %$articles); my $scale = ceil($max_count / $width); print " Index Count\n"; print "------------- -------", "-" x 50, "\n"; my @bins = sort {$a <=> $b} keys %$articles; foreach my $bin (min(@bins)..max(@bins)) { my $count = $articles->{$bin} || 0; my $extra = ($count % $scale) ? '.' : ''; my $start = $bin * $bin_size; my $end = $start + $bin_size - 1; printf "%4d .. %4d \[%4d\] %s$extra\n", $start, $end, $count, '#' x floor ($count / $scale); } print "\n Scale: #=$scale\n\n" if $scale > 1; }
--Chris

e-mail jcwren

Replies are listed 'Best First'.
Re: (jcwren) Re: Creating an array of unique numbers (Benchmarks!)
by archon (Monk) on Mar 28, 2001 at 02:14 UTC
    You made me curious, so I tested the algorithm I proposed and came up with these results:
    100 Lines, 30 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 6 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) test_jcwren: 19 wallclock secs (17.80 usr + 0.02 sys = 17.82 CPU) test_lucs: 7 wallclock secs ( 6.30 usr + 0.00 sys = 6.30 CPU) test_lucsjcwren: 6 wallclock secs ( 5.84 usr + 0.00 sys = 5.84 CPU) test_lucsjcwren2: 6 wallclock secs ( 5.55 usr + 0.01 sys = 5.56 CPU +) 500 Lines, 150 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 28 wallclock secs (26.94 usr + 0.01 sys = 26.95 CPU) test_jcwren: 93 wallclock secs (88.83 usr + 0.11 sys = 88.94 CPU) test_lucs: 32 wallclock secs (30.66 usr + 0.00 sys = 30.66 CPU) test_lucsjcwren: 30 wallclock secs (28.46 usr + 0.02 sys = 28.48 CPU) test_lucsjcwren2: 29 wallclock secs (26.97 usr + 0.03 sys = 27.00 CPU +) 1000 Lines, 300 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 58 wallclock secs (54.92 usr + 0.05 sys = 54.97 CPU) test_jcwren: 189 wallclock secs (179.66 usr + 0.13 sys = 179.79 CPU) test_lucs: 66 wallclock secs (63.03 usr + 0.05 sys = 63.09 CPU) test_lucsjcwren: 61 wallclock secs (58.53 usr + 0.05 sys = 58.58 CPU) test_lucsjcwren2: 57 wallclock secs (54.80 usr + 0.05 sys = 54.85 CPU +)
    test_archon:
    sub test_archon { my %numhash; for (1 .. $Max_Questions) { my $randnum = int(rand($Max_Lines) + 1); redo if exists ($numhash{$randnum}); $numhash{$randnum} = 1; } return [keys (%numhash)]; }

    Edit 2001-03-26 by tye to remove <pre>

      You need to try your algorythm with a large list when you want all (or even nearly all) of the question numbers! (:

              - tye (but my friends call me "Tye")
        I expected much worse results, but here are the results of grabbing 900 unique numbers from a list of 1000:
        Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 259 wallclock secs (242.45 usr + 0.14 sys = 242.59 CPU) test_jcwren: 213 wallclock secs (202.54 usr + 0.16 sys = 202.70 CPU) test_lucs: 285 wallclock secs (267.02 usr + 0.27 sys = 267.28 CPU) test_lucsjcwren: 268 wallclock secs (248.88 usr + 0.24 sys = 249.12 C +PU) test_lucsjcwren2: 255 wallclock secs (231.91 usr + 0.21 sys = 232.12 +CPU)
        This is where jcwren's algorithm begins to outperform lucs and mine. I'm surprised mine is still about as fast as lucs', though.