Creating an array of unique numbers

TStanley has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(jcwren) Re: Creating an array of unique numbers by jcwren (Prior) on Mar 27, 2001 at 10:38 UTC
How about creating an array with 0..Number_Of_Lines_In_File, using Algorithm::Numerical::Shuffle to randomize, then taking the first 'n' slots as the indexes to the question? (Ask one of the math geeks about perhaps creating a random list of 0..$Max_Questions, and using that to select from the shuffled list. It may improve that randomicity. I dunno.) This one particular run seems weighted to lower values, but a purely visual guess of subsequent runs appears to have moderately even distributions over the range. #!/usr/local/bin/perl -w use strict; use Algorithm::Numerical::Shuffle qw(shuffle); srand(); my @array1=(); my $Max_Questions = 30; #Will vary upon how many questions #the person wants to answer my $ref = random(\@array1,$Max_Questions); sub random{ my $array = shift; my $Max = shift; my $lines = 100; #There is a function that gets the #number of lines in the actual data #file, but there will be at least 100. @$array = (shuffle (0..$lines))[0..$Max_Questions]; my @sorted = sort {$a<=>$b} @$array; print"Element\t\tUnsorted\tSorted\n"; print"-------\t\t--------\t------\n"; for(my $z = 0;$z<$Max;$z++){ print" $z"; print"\t\t @$array[$z]\t\t"; print" $sorted[$z]\n"; } } [download] `Element Unsorted Sorted ------- -------- ------ 0 30 0 1 48 1 2 70 3 3 55 4 4 3 6 5 23 8 6 45 10 7 1 21 8 33 22 9 31 23 10 28 26 11 0 27 12 97 28 13 6 30 14 91 31 15 10 33 16 22 36 17 36 42 18 26 43 19 43 45 20 69 48 21 27 50 22 59 55 23 8 59 24 72 65 25 65 69 26 94 70 27 42 72 28 50 91 29 4 94` [download] --Chris e-mail jcwren	[reply] [d/l] [select]
(tye)Re: Creating an array of unique numbers by tye (Sage) on Mar 27, 2001 at 11:21 UTC
Create a list of all possible question numbers and start to perform a standard Fisher-Yates shuffle on it. You can then stop after N steps and you will have N question numbers chosen perfectly at random (and the unchosen numbers sorted after that). - tye (but my friends call me "Tye")	[reply]
Re: Creating an array of unique numbers by archon (Monk) on Mar 27, 2001 at 10:41 UTC
There are certainly ways to do what you are trying to do, but it is much simpler to put all the questions into an array, then randomize that array, and pop or shift X number of questions off of it. As far as generating X unique numbers, try this: `my $max = 10; my $num = 5; my %numhash; for (1 .. $num) { my $randnum = int(rand($max) + 1); redo if exists ($numhash{$randnum}); $numhash{$randnum} = 1; } for (keys (%numhash)) { print "$_\n"; }` [download]	[reply] [d/l]
(jcwren) Re: Creating an array of unique numbers (Benchmarks!) by jcwren (Prior) on Mar 28, 2001 at 00:28 UTC
I decided to Benchmark the original routine I presented, along with lucs code, and a modification to it. Using the shuffle method is clearly the least efficient, under every test case. lucs method is really slick, and I was able to boost the speed with a little enhancing. I wasn't up to trying to code up tye's method. Below are the benchmarks for the various methods. The speed difference is consistently 2.9 times from fastest to slowest. I don't know how the shuffle() routine works, but it doesn't seem to be affected by the size of the data set. Since I do know how lucs method works, I didn't expect it to be non-linear (and it's not). An interesting note in the 'test_jcwren' vs. 'test_jcwren2' tests. The only difference is assigning 'undef' instead of 0 to the hash element. Since exists() knows that 'undef' is a legal value for a hash value, we can shave a significant amount of time off of 100,000 interations. In fact, simply assigning 'undef' unstead of 0 account for 8 CPU seconds. I guess because the interpeter doesn't have to create a scalar and do the conversion, it's a little faster. I've also left some code in their if you want to check the output and distributions of the random numbers. I hacked tilly's histogram code a previous node. All 4 implementations appear to have the same distribution, and produce values across the requires span. 100 Lines, 30 Questions Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu +csjcwren, test_lucsjcwren2... test_jcwren: 73 wallclock secs (43.26 usr + 0.08 sys = 43.34 CPU +) test_lucs: 30 wallclock secs (17.27 usr + 0.01 sys = 17.28 CPU +) test_lucsjcwren: 28 wallclock secs (16.32 usr + 0.00 sys = 16.32 CPU +) test_lucsjcwren2: 26 wallclock secs (15.61 usr + 0.01 sys = 15.62 CPU +) 500 Lines, 150 Questions Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu +csjcwren, test_lucsjcwren2... test_jcwren: 379 wallclock secs (217.85 usr + 0.15 sys = 218.00 +CPU) test_lucs: 145 wallclock secs ( 82.29 usr + 0.19 sys = 82.48 +CPU) test_lucsjcwren: 137 wallclock secs ( 79.63 usr + 0.14 sys = 79.77 +CPU) test_lucsjcwren2: 128 wallclock secs ( 75.22 usr + 0.02 sys = 75.24 +CPU) 1000 Lines, 300 Questions Benchmark: timing 100000 iterations of test_jcwren, test_lucs, test_lu +csjcwren, test_lucsjcwren2... test_jcwren: 754 wallclock secs (438.51 usr + 0.50 sys = 439.01 +CPU) test_lucs: 291 wallclock secs (168.96 usr + 0.12 sys = 169.08 +CPU) test_lucsjcwren: 277 wallclock secs (162.44 usr + 0.04 sys = 162.48 +CPU) test_lucsjcwren2: 267 wallclock secs (154.43 usr + 0.11 sys = 154.54 +CPU) [download] #!/usr/local/bin/perl -w use strict; use Algorithm::Numerical::Shuffle qw(shuffle); use List::Util qw(min max sum); use POSIX qw(ceil floor); use Benchmark; my $Max_Questions = 0; my $Max_Lines = 0; { my $results; my %hash; for (([100,30]), ([500,150]), ([1000,300])) { $Max_Lines = @$_[0]; $Max_Questions = @$_[1]; printf ("%d Lines, %d Questions\n", $Max_Lines, $Max_Questions); timethese (100000, { 'test_jcwren' => sub { test_jcwren () + }, 'test_lucs' => sub { test_lucs () + }, 'test_lucsjcwren' => sub { test_lucsjcwren +() }, 'test_lucsjcwren2' => sub { test_lucsjcwren2 + () }, } ); print "\n"; } =cut # # Be sure to set $Max_Lines and $Max_Questions at the top # for (0..1000) { $results = test_lucsjcwren (); $hash {$_}++ foreach (@$results); } show_array ($results); show_histogram (1, \%hash); =cut } # # Test cases # sub test_jcwren { return \@{[(shuffle (0..$Max_Lines-1)) [0..$Max_Questions-1]]}; } sub test_lucs { my $m = $Max_Questions; my $j = $Max_Lines - $m + 1; my %sample; while ($m-- > 0) { my $val = int ($j * rand (1)); $sample{ exists $sample{$val} ? $j - 1 : $val } = 0; ++$j; } return (\@{[keys %sample]}); } sub test_lucsjcwren { my $m = $Max_Questions; my $j = $Max_Lines - $m + 1; my %sample; while ($m--) { my $val = int (rand ($j++)); $sample {exists $sample{$val} ? $j - 2 : $val} = 0; } return (\@{[keys %sample]}); } sub test_lucsjcwren2 { my $m = $Max_Questions; my $j = $Max_Lines - $m + 1; my %sample; while ($m--) { my $val = int (rand ($j++)); $sample {exists $sample{$val} ? $j - 2 : $val} = undef; } return (\@{[keys %sample]}); } # # Utilities # sub show_array { my $array = shift; my @sorted = sort {$a <=> $b} @$array; print"Element\t\tUnsorted\tSorted\n"; print"-------\t\t--------\t------\n"; for (my $z = 0; $z < $Max_Questions; $z++) { print" $z"; print"\t\t @$array[$z]\t\t"; print" $sorted[$z]\n"; } } sub show_histogram { my $bin_size = shift; my $articles = shift; my $width = 50; my $max_count = max (values %$articles); my $scale = ceil($max_count / $width); print " Index Count\n"; print "------------- -------", "-" x 50, "\n"; my @bins = sort {$a <=> $b} keys %$articles; foreach my $bin (min(@bins)..max(@bins)) { my $count = $articles->{$bin} \|\| 0; my $extra = ($count % $scale) ? '.' : ''; my $start = $bin * $bin_size; my $end = $start + $bin_size - 1; printf "%4d .. %4d \[%4d\] %s$extra\n", $start, $end, $count, '#' x floor ($count / $scale); } print "\n Scale: #=$scale\n\n" if $scale > 1; } [download] --Chris e-mail jcwren	[reply] [d/l] [select]
Re: (jcwren) Re: Creating an array of unique numbers (Benchmarks!) by archon (Monk) on Mar 28, 2001 at 02:14 UTC
You made me curious, so I tested the algorithm I proposed and came up with these results: 100 Lines, 30 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 6 wallclock secs ( 5.40 usr + 0.00 sys = 5.40 CPU) test_jcwren: 19 wallclock secs (17.80 usr + 0.02 sys = 17.82 CPU) test_lucs: 7 wallclock secs ( 6.30 usr + 0.00 sys = 6.30 CPU) test_lucsjcwren: 6 wallclock secs ( 5.84 usr + 0.00 sys = 5.84 CPU) test_lucsjcwren2: 6 wallclock secs ( 5.55 usr + 0.01 sys = 5.56 CPU +) 500 Lines, 150 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 28 wallclock secs (26.94 usr + 0.01 sys = 26.95 CPU) test_jcwren: 93 wallclock secs (88.83 usr + 0.11 sys = 88.94 CPU) test_lucs: 32 wallclock secs (30.66 usr + 0.00 sys = 30.66 CPU) test_lucsjcwren: 30 wallclock secs (28.46 usr + 0.02 sys = 28.48 CPU) test_lucsjcwren2: 29 wallclock secs (26.97 usr + 0.03 sys = 27.00 CPU +) 1000 Lines, 300 Questions Benchmark: timing 5000 iterations of test_archon, test_jcwren, test_lu +cs, test_lucsjcwren, test_lucsjcwren2... test_archon: 58 wallclock secs (54.92 usr + 0.05 sys = 54.97 CPU) test_jcwren: 189 wallclock secs (179.66 usr + 0.13 sys = 179.79 CPU) test_lucs: 66 wallclock secs (63.03 usr + 0.05 sys = 63.09 CPU) test_lucsjcwren: 61 wallclock secs (58.53 usr + 0.05 sys = 58.58 CPU) test_lucsjcwren2: 57 wallclock secs (54.80 usr + 0.05 sys = 54.85 CPU +) [download] test_archon: `sub test_archon { my %numhash; for (1 .. $Max_Questions) { my $randnum = int(rand($Max_Lines) + 1); redo if exists ($numhash{$randnum}); $numhash{$randnum} = 1; } return [keys (%numhash)]; }` [download] Edit 2001-03-26 by tye to remove `<pre>`	[reply] [d/l] [select]
(tye)Re2: Creating an array of unique numbers (Benchmarks!) by tye (Sage) on Mar 28, 2001 at 02:38 UTC
You need to try your algorythm with a large list when you want all (or even nearly all) of the question numbers! (: - tye (but my friends call me "Tye")	[reply]
Re: (tye)Re2: Creating an array of unique numbers (Benchmarks!) by archon (Monk) on Mar 28, 2001 at 03:38 UTC
Re (tilly) 5: Creating an array of unique numbers (Benchmarks!) by tilly (Archbishop) on Mar 28, 2001 at 09:30 UTC
Re: Creating an array of unique numbers by lucs (Sexton) on Mar 27, 2001 at 20:54 UTC
# The solutions presented so far by the other monks all return a # subset of a shuffle of the whole (except tye's) array of possible # numbers. Compare the following beautiful algorithm, which calls # rand() only as many times as there are numbers to pick. It is # attributed to Bob Floyd by Jon Bentley, in his "More Programming # Pearls" (Addison-Wesley, 1988). Here is Bentley's explanation, # which I've adapted for the displayed Perl code: # # "We can appreciate the correctness of [the algorithm] # anecdotally. When $m is 5 and $n is 10, the algorithm first [...] # computes in %sample a 4-element sample in the range 0..8. Next it # assigns to $val a random integer in the range 0..9. Of the 10 # values that $val can assume, exactly 5 result in inserting 9 into # %sample: the four values already in %sample, and the value 9 # itself. Thus element 9 is inserted into the set with the correct # probability of 5/10." use strict; sub sample { # Returns a list of $m different random integers # between 0 and $n - 1. my ($m, $n) = @_; my %sample; my $j = $n - $m + 1; while ($m-- > 0) { my $val = int($j * rand(1)); $sample{ exists $sample{$val} ? $j - 1 : $val } = 0; ++$j; } keys %sample; } [download]	[reply] [d/l]
Re: Creating an array of unique numbers by a (Friar) on Mar 27, 2001 at 10:53 UTC
Well, not sure where your code goes awry (though it doesn't seem like checking the last 2, er 3 (length -2) array entries would keep the array unique - 41 is appearing as element 4 and 20) but how about a hash? Something like: `while ( scalar keys %qs < $Max - 1 ) { $qs{int(rand($lines))}++; }` [download] a	[reply] [d/l]
Re: Creating an array of unique numbers by TStanley (Canon) on Mar 28, 2001 at 02:19 UTC
My thanks to all for your excellent solutions! I have just about finished the program, and I will be posting it in the Code Catacombs when I have put the final touches on it. I would welcome all comments/suggestions at that time. TStanley In the end, there can be only one!	[reply]