Generating 0 .. N Randomly and Efficiently

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
When solving a problem with a limited toolbox, you often have to use other tools creatively or learn new ones. Recently, I have been solving old problems by intentionally limiting my toolbox to see what I learn. Here is my latest:

Generate all numbers 0 .. N randomly using as little memory as possible as fast as possible

The general problem here is that the more numbers you choose, the larger the number of possible misses becomes.
My ideas for a solution were numerous variations on:

Create a bitstring of zeros ( size = N / 8 )
Establish initial endpoints of the bitstring
Randomly select a bit within the endpoints
If bit has already been chosen, choose the nearest available bit in a random direction
If bit is either endpoint, move the endpoint to the nearest unchosen bit

I did learn a bit about bitstrings and bit operators, but not enough to be succesful at my task. The following code is my latest variation which is the most memory efficient, but not the fastest.

#!/usr/bin/perl
use strict;
use warnings;

my $next = Gen_Iter( $ARGV[0] || 10 );

while ( defined( my $rand = $next->() ) ) {
    print "$rand\n";
}

sub Gen_Iter {
    my $wrk = shift;

    # Create initial bitstring
    my $bytes = int ( $wrk / 8 );
    my $pad = 8 - $wrk % 8;
    $pad += $pad == 8 ? -8 : 0;
    $bytes++ if $pad;
    my $str = "\0" x $bytes;
    vec($str, $_, 1) = 1 for $wrk .. $bytes * 8 - 1;

    # Establish initial boundaries
    my $beg = 0;
    my $end = $wrk;


    # return magic sub
    return sub {
        return () if ! $wrk--;
        my $num = Get_Random->($str, $beg, $end);
        $beg = Post_Zero( $str, $num ) if $num == $beg;
        $end = Prev_Zero( $str, $num ) if $num == $end;
        vec($str, $num, 1) = 1;
        return $num;
    };
}

# Count leading 1s in a bitstring
sub Leading {
    my $str = shift;
    $str =~ /^(\xFF*)/;
    my $skip = length $1;
    return -1 if $skip == length $str;
    my $byte = unpack("x${skip}A1", $str);
    my $extra = -1;
    while ( $extra++ < 6 ) {
        last if ! vec($byte, $extra, 1);
    }
    return $skip * 8 + $extra;
}

# Count trailing 1s in a bitstring
sub Trailing {
    my $str = shift;
    my $len = length $str;
    $str =~ /(\xFF*)$/;
    my $match = length $1;
    return -1 if $len == $match;
    my $skip = $len - $match - 1;
    my $byte = unpack("x${skip}A1", $str);
    my $extra = 8;
    while ( $extra-- ) {
        last if ! vec($byte, $extra, 1);
    }
    return ($len - $skip - 1) * 8 + (7 - $extra);
}

# Find a 0 before current location (-1 if none)
sub Prev_Zero {
    my ($str, $orig) = @_;
    my $cur = $orig;
    my $pos = $cur % 8;
    if ( $pos ) {
        while ( $pos-- ) {
            return $cur if ! vec($str, --$cur, 1);
        }
    }
    my $length = int($orig / 8);
    my $prev = substr($str, 0, $length);
    return -1 if ! $prev;
    my $gap = Trailing( $prev );
    return -1 if $gap == -1;
    return $orig - (1 + $gap + $orig % 8);
}

# Find a 0 after current location (-1 if none)
sub Post_Zero {
    my ($str, $orig) = @_;
    my $cur = $orig;
    my $pos = $cur % 8;
    if ( $pos < 7 ) {
        for ( $pos + 1 .. 7 ) {
            return $cur if ! vec($str, ++$cur, 1);
        }
    }
    my $byte = int($orig / 8);
    return -1 if $byte == length( $str ) - 1;
    my $tail = substr($str, $byte + 1);
    my $gap = Leading( $tail );
    return -1 if $gap == -1;
    return $orig + $gap + ( 8 - $orig % 8 );
};

# Generate a random number within a specified range
sub Gen_Random { return $_[0] + int( rand( $_[1] - $_[0] ) ); }

# Select an available number randomly accounting for ghost
sub Get_Random {
    my ($str, $beg, $end) = @_;
        my $guess = Gen_Random($beg, $end);
        my $dir = int rand 2;
        while ( $guess != -1 && vec($str, $guess, 1) ) {
            $guess = $dir ? Post_Zero( $str, $guess ) : Prev_Zero( $st
+r, $guess ) ;
        }
        $guess = $dir ? $beg : $end if $guess == -1;
        return $guess;
}
[download]

The idea was to keep shrinking the search space and to not keep guessing at random on a unsuccessful match.

It also has a bug or two that I didn't bother tracking down as I have given up. I was hoping I might learn a few things from how others solved the same problem using the same restrictions. It will also be interesting to see if people have the same general idea (bitstring), but are capable of making it work.

Cheers - L~R

Update:
This post is not a request for a code critique, but a solicitation for your ideas on how to solve the same problem with the same constraints.

While I do appreciate critiques, I have already admitted this is a failure, but not for some of the faulty assumptions others have made. Please take the time to understand the code if you are going to provide a critique.

I fully acknowledge that this approach will not produce a statistically uniform distribution. While not a primary goal, it would have been nice and is one of the reasons I said it was a failure and asked how else someone might try to solve this problem.

Comment on Generating 0 .. N Randomly and Efficiently Download Code

Replies are listed 'Best First'.
Re: Generating 0 .. N Randomly and Efficiently by bart (Canon) on Oct 19, 2004 at 15:57 UTC
I'd just generate an array containing the numbers 0 .. N, and shuffle it. `use List::Util 'shuffle'; my @random = shuffle 0 .. $N;` [download] The problem with your approach is that not only the chances of misses rises as you have selected more items — it is not even garanteed to ever finish; but also, as you try to take the nearest item to your latest miss, I'm quite sure its randomness will not be that uniform, as items close to clusters of items that have been selected already, have a higher chance of being picked next.	[reply] [d/l]
Re^2: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 16:31 UTC
bart, I'd just generate an array containing the numbers 0 .. N, and shuffle it. As would I if I wasn't trying to learn new things by intentionally trying to avoid simple/easy solutions. The problem with your approach is that not only the chances of misses rises as you have selected more items — it is not even garanteed to ever* finish* In my approach, I minimized the hit for making a miss by just selecting the nearest unchosen bit. The direction to look for that bit is random. Additionally, if there are none available, it randomly selects an endpoint which is guaranteed to be unchosen. It is guaranteed to finish. Cheers - L~R	[reply]
Re^3: Generating 0 .. N Randomly and Efficiently by bart (Canon) on Oct 19, 2004 at 18:40 UTC
I wonder... if your aim is to learn something, then why is your approach so much like most newbies' approaches to shuffling arrays? I mean, picking an item, and if it's already been picked, try again... Just see Google Groups for examples on the newsgroups. That just seems like a step backward to me... Anyway, though you do take care to make sure every loop finds at least one item, you will not get a nice even distribution. Let me give an example. Suppose that your $N is 20, and by some weird coincidence, the numbers 1 .. 10 have already been picked, and nothing else. For your next pick, the numbers 12 to 20 all have 1 chance in 21 of being picked. So do the numbers 0 and 11. But if the drawn number is between 1 and 10 (about 10 chances in 21, that is almost 50%), your system will promote it to either 0 or 11... Net result: there's 12 chances in 21, that is over 50%, that the next number picked, will be either 0 or 11. So yes, this way, you'll tend to get clustered results, instead of a nicely spread, speckled, outcome.	[reply]
Re^4: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 18:56 UTC
Re^2: Generating 0 .. N Randomly and Efficiently by Anonymous Monk on Oct 20, 2004 at 10:01 UTC
The problem with your approach is that not only the chances of misses rises as you have selected more items — it is not even garanteed to ever finish; Unless there's a bug in the implementation, the approach will finish. In fact, it's bounded by `O(N²)`. Each iteration will find a new number - if the initial guess hit an already picked number, a scan for an unpicked number is made. And until all numbers have been picked, a scan will find an unpicked number. Now, the approach won't be fair (due to clustering), and it won't be fast (quadratic vs. linear), and the implementation might contain bugs (and is huge for such a simple problem), the algorithm will finish (the implementation might not).	[reply]
Re: Generating 0 .. N Randomly and Efficiently by blokhead (Monsignor) on Oct 19, 2004 at 17:31 UTC
With your algorithm, you won't get a truly random permutation of 0..N. Expanding on what ysth has said already, suppose N=9. If my random number generator happened to pick 0,4,1,2,3 as the initial segment of the permutation, then with probability 1/2 the next number in the permutation will be either 5 or 9, while the other 1/2 of the probability is shared among 6, 7, and 8. They are not all equally likely! In a truly random permutation, all yet-unselected numbers will be equally likely as being the next one to appear. To get around this, don't search left and right if you hit a bit that's already been chosen. Just keep choosing and choosing randomly until you hit an unselected bit. With nonzero probability you will terminate ;) This is the only way to choose a permutation truly at random (other than filtering out (or bookkeeping) the unselected bits and randomly picking from them). Another approach which still builds the entire list, but you may still appreciate because it needs only one call to rand is this: If you look at the swaps made in the Fisher-Yates shuffle, you can build a 1-to-1 mapping from the swaps you performed to the integers between 1 and N!. This is a ranking function, and it induces an ordering over the permutations. An unranking function just does the inverse, converts a number between 1 and N! into the swaps needed to build the permutation: `sub unrank { my ($n, $rank) = @_; my @perm = (0 .. $n); my @swap; for (reverse 2 .. $n) { $swap[$_] = $rank % $_; $rank = int($rank / $_); } for (2 .. $n) { @perm[ $swap[$_], $_ ] = @perm[ $_, $swap[$_] ] if $swap[$_]; } return @perm[1 .. $n]; }` [download] In other words, this converts the uniform random distribution over the integers from 1 to N! into the uniform random distribution over permutations of N. Ranking/unranking functions is a standard way to select combinatorial structures uniformly at random using just one call to rand (compare to F-Y shuffle, which needs N calls to rand to return a random permutation). Usually you rank/unrank to the standard lexicographic ordering, but in the case of permutations, the ordering induced by the F-Y shuffle actually lets you rank and unrank faster. You could try to unravel this unranking function so that it doesn't build the list in memory. In other words, given the rank r of a permutation in some ordering, and an index i, what is the i-th position of the r-th permutation? I'm not sure it's possible with this ordering, but perhaps in a lexicographic ranking/unranking scheme it would be. For the record, I got a lot of this information about ranking and unranking permutations from the online book Combinatorial generation, which has lots of interesting algorithmic material relating to permutations, combinations, subsets, trees, partitions, etc. Specifically look at the section of chapter 4 relating to permutations (and exercise 23). Update: I wrote this, ran off to class, and in the middle of lecture realized my unrank function was a bit off.. It's fixed now ;) blokhead	[reply] [d/l]
Re^2: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 17:47 UTC
blokhead, If my random number generator happened to pick 0,4,1,2,3 as the initial segment of the permutation, then with probability 1/2 the next number in the permutation will be either 5 or 9, while the other 1/2 of the probability is shared among 6, 7, and 8. That is not actually the case. Remember that every time an endpoint is chosen, the endpoints move. The number being selected is only within the endpoints. In your example, all unchosen numbers would have the same probability of being selected. To get around this, don't search left and right if you hit a bit that's already been chosen. Just keep choosing and choosing randomly until you hit an unselected bit. With nonzero probability you will terminate ;) This is the only way to choose a permutation truly at random In my, admittedly twisted, mind - this is much closer to being random then you are giving it credit for. The endpoints automatically adjusts to keep from bouncing between 2 numbers The selection point itself is random The direction to choose from is random As there are larger and larger holes, the two numbers on either sides of those holes have a higher chance of being selected, but it is still random (just not statistically uniform). In theory, the holes should grow at the same rate, though I wouldn't be willing to make that bet. Thanks for the ideas and suggestions - I may check them out. Cheers - L~R	[reply]
Re^3: Generating 0 .. N Randomly and Efficiently by blokhead (Monsignor) on Oct 19, 2004 at 19:40 UTC
In your example, all unchosen numbers would have the same probability of being selected. Yes, that's my point ;) ...this is much closer to being random then you are giving it credit for... The selection point itself is random The direction to choose from is random ...still random (just not statistically uniform)... <nitpick> Then you need to qualify your use of the word "random." In the context of selecting things at "random," the word is meaningless without a probability distribution. If the distribution is not specified, then the only reasonable probability distribution to assume is the uniform distribution. It usually doesn't make sense to choose from a finite set of objects in a nonuniform way. It is for this reason that in CS we often say "choose at random" to implicitly mean "choose uniformly at random." But please, do not use an unqualified "random" to mean "complicated to predict" or "generated by a process that uses the random number generator" or even "chosen according to some probability distribution that assigns positive probability to all outcomes." Yes there is a difference, and this is where confusions start. Yes indeed, for many cases, your code would be "random enough," in that the probability of most permutations is not hugely different from 1/N!. But is this "random enough?" `sub random_perm { my $n = shift; return rand() < 0.2 ? (1 .. $n) : reverse (1 .. $n); }` [download] Well, it's random, but not uniform. It just happens to follow my personal favorite probability distribution ;) If there is a criteria for being "random enough", then you'd have to define what it means for a probability distribution to be "random enough" and then show convincingly that your code follows it! Basically my point is that choosing something "randomly enough" or "close to randomly" or whatever is a meaningless expression, and the concept behind it is shaky. </nitpick> Anyway, I realize this is a little tangential from the topic at hand and I take full responsibility for jumping all over it. But I often see this term misused/abused and wanted to have my thoughts on public record ;) What were we talking about again? Oh yes, permutations using no memory ;) blokhead	[reply] [d/l]
Re^4: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 19:50 UTC
Re^5: Generating 0 .. N Randomly and Efficiently by tilly (Archbishop) on Oct 20, 2004 at 19:33 UTC
Re: Generating 0 .. N Randomly and Efficiently by ysth (Canon) on Oct 19, 2004 at 17:03 UTC
I have the feeling that searching forward or backward won't result in a well shuffled result; I'd think you'd tend to end up with some ranges of numbers clumping toward the end and some toward the beginning. Just a feeling, though... One alternative is to do random selection (repeating until successful) for the first X percent of N, then switching to scanning through the whole bitmap until you find the rand(number-remaining)'th remaining number. Some benchmarks would be needed to pick a good X; I'd guess 90%.	[reply]
Re^2: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 17:20 UTC
ysth, I haven't spent a lot of time analyzing the distribution frequency, but this implementation does have the possibility of clumping as both you and bart indicate. This, however, is only one of dozens of variations I tried over the last couple of days and others were better at being more statistically uniform. I was hoping someone else would give it a try to see what they could come up with as there are far smarter monks here than myself. I know it seems silly. The answer without restrictions is obvious, assign to an array and shuffle. It just seems to be that there should be more efficient ways to reduce the search space, determine an unchosen bit randomly without a lot of work cycles, etc. It is obvious to me I need to learn more about bitstrings and bitwise operators in general. Cheers - L~R	[reply]
Re: Generating 0 .. N Randomly and Efficiently by tmoertel (Chaplain) on Oct 20, 2004 at 01:20 UTC
You may wish to consider the following variant of your problem: Generate a randomly permuted M-length subset of the integers in the range 0..N. What makes this variant especially interesting is that the trivial solutions based upon shuffling 0..N are no longer viable because of inefficiency. (Consider the case when M=5 and N=2^48.) In Jon Bentley's book More Programming Pearls: Confessions of a Coder (ISBN 0201118890), he discusses this problem at length and describes one of Knuth's solutions before arriving at the following, beautiful algorithm by the late Robert W Floyd: # The following algorithm is due to Bob Floyd and is # described in Jon Bentley's, _More Programming Pearls_ # (1988), on p. 143. # # In this Perl implementation, I use a hash as an efficient, # randomly-addressable series that supports insertion in # O(1) time. This implementation selects subsets of size $m # from the range [0,$n], without replacement. (The original # from Bentley's book used the range [1,n]). sub randomly_permuted_subset_of_range { my ($m, $n) = @_; my (@result, %series, $head); for (my $j = $n-$m+1; $j <= $n; $j++) { my $t = int rand($j+1); if (exists $series{$t}) { # insert $j into series after $t @series{$t,$j} = ($j, $series{$t}); } else { # prefix $t to series ($head, $series{$t}) = ($t, $head); } } for ($n = $head; defined $n; $n = $series{$n}) { push @result, $n; } return \@result; } [download] The algorithm runs in time O(M) and requires storage of only M cells. (My implementation actually requires 2M cells because, for easy Data Dumping, I copy the resulting sequence into an array before returning it, but this step isn't required. BTW, I use the term "cell" to represent the approximate storage required by a hash entry or an array element.) Some examples: `use Data::Dumper; $Data::Dumper::Indent = 0; print Dumper( randomly_permuted_subset_of_range(5, 10) ); print Dumper( randomly_permuted_subset_of_range(11, 10) ); print Dumper( randomly_permuted_subset_of_range(5, 248) );` [download] And the corresponding output: `$VAR1 = [2,1,7,0,5]; $VAR1 = [7,8,1,5,3,2,4,10,0,9,6]; $VAR1 = ['211572034652645','198222342410498', '42862092043890','253205057797894', '185109518737615'];` [download] Note that when M=N+1, Floyd's algorithm solves your original problem for any given N. Cheers, Tom Tom Moertel** : Blog / Talks / CPAN / LectroTest / PXSL / Coffee / Movie Rating Decoder	[reply] [d/l] [select]
Re: Generating 0 .. N Randomly and Efficiently by Random_Walk (Prior) on Oct 19, 2004 at 17:14 UTC
Here is a first shot, I get perl to allocate me a zero filled string of the correct size. calculate a chance step based on number of numbers left to find. Pick a random number then decrement this in countstep increments each time if find an unset bit in my string. once this is < 0 I look for the next unset bit (if I am not on one), set the bit and return its index. I can think of a couple of optimisations, moving the endpoints would be a help. for the early numbers I may partiton the space and dive into about the right subblock searching from there although this does risk reducing the randomness... `#!/usr/local/bin/perl use strict; my $number=1000; my $string; vec $string, $number, 1; while ($number) { my $chance_step = 1/$number; my $rand=rand; my $i=0; while ($rand > 0) { if (vec $string, $i++, 1) { next } else { $rand-=$chance_step } } while (vec $string, $i, 1){$i++} vec($string, $i, 1)=1; print "$i, "; $number-- }` [download] Cheers, R.	[reply] [d/l]
Re^2: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 17:28 UTC
Random_Walk, This is very much like one of my first implementations. The trouble is that by using vec to determine if a bit has been selected, it will get slower and slower the closer to the end it gets. Try changing $number to 50_000 and see what happens the longer it runs. Cheers - L~R	[reply] [d/l]
Re^3: Generating 0 .. N Randomly and Efficiently by Random_Walk (Prior) on Oct 20, 2004 at 13:31 UTC
The problem is not with vec, it is the way I start at the start of the string and count off non set bits till I reach $rand. The following timings are on a dev machine being used by a few so there is plenty of noise. The time to read the last bits of a long string using vec looks constant even with strings spanning several orders of magnitude. #!/usr/local/bin/perl -w use strict; use Benchmark qw(timethese); timethese(5000000, { '1K' => vectest(1000), '10K' => vectest(10000), '100K' => vectest(100000), '1M' => vectest(1000000), '10M' => vectest(10000000), }); sub vectest { my $n=shift; my $s=""; vec $s,$n,1; my $look=vec $s,$n-10,1 } __END__ Benchmark: timing 5000000 iterations of 100K, 10K, 10M, 1K, 1M... 100K: 0 wallclock secs ( 0.59 usr + 0.00 sys = 0.59 CPU) 10K: 1 wallclock secs ( 0.42 usr + 0.00 sys = 0.42 CPU) 10M: 1 wallclock secs ( 0.49 usr + 0.00 sys = 0.49 CPU) 1K: 1 wallclock secs ( 0.89 usr + 0.00 sys = 0.89 CPU) 1M: 1 wallclock secs ( 0.51 usr + 0.00 sys = 0.51 CPU) [download] I have a serious optimisation up my sleave but and rather too busy to code it now :( Hope to have a shot when I get home this evening Cheers, R.	[reply] [d/l]
Re: Generating 0 .. N Randomly and Efficiently by Velaki (Chaplain) on Oct 19, 2004 at 16:05 UTC
This looks like shuffling a deck of cards. Wouldn't it be easier to create an array of the appropriate size, populate it, shuffle it, and then print the numbers? `#!/usr/bin/perl use strict; use warnings; my $SIZE = 10; my @a = (0..$SIZE); for (@a) { my $i = int(rand($SIZE+1)); ($a[$i],$a[$_])=($a[$_],$a[$i]) } print "$_\n" for @a;` [download] Hope this helped, -v "Perl. There is no substitute."	[reply] [d/l]
Re^2: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 16:23 UTC
Velaki, Wouldn't it be easier... Yes of course, but that's not the point. The point is to learn new things by removing tools (memory) from the toolbox. Cheers - L~R	[reply]
Re^3: Generating 0 .. N Randomly and Efficiently by ikegami (Patriarch) on Oct 19, 2004 at 16:33 UTC
How does Velaki's solution use more memory (except maybe `for (@a)`, which could be replaced with `for ($j=0; $j<@a; $j++)`)? Update: Ah yes, you return an interator instead of building an actual list.	[reply] [d/l] [select]
Re^4: Generating 0 .. N Randomly and Efficiently by Limbic~Region (Chancellor) on Oct 19, 2004 at 16:37 UTC
Re: Generating 0 .. N Randomly and Efficiently by Random_Walk (Prior) on Oct 21, 2004 at 16:28 UTC
This was harder than I expected, I have been plagued by off by one errors and infinite loops but here is a version with a couple of optimations Moving start point up when start bits full moving endpoint down when it makes sense making a guess where to start the search for each number It starts out nice and fast but bogs down for big sets looking for those last few unused values and I must admit it is pig ugly code. I think overall Limbic~Region's way of doing it is preferable. #!/usr/local/bin/perl use strict; use integer; $!++; my $number=10000; my $num_per_block=10; my $string; my $start=0; my $end=$number-1; vec $string, $end, 1; while ($number) { my $block_size=(($end-$start+1)/$number)$num_per_block; my $rand=int rand($number+1); my $found=1; # optimist my $guess_block=int($rand / $block_size); my $i=$start+($guess_block$block_size); my $fiddled_rand=$rand-($guess_block*$num_per_block); while ($fiddled_rand > -1) { if (vec $string, $i, 1) { if ($i==$end+1) { # we ran off the end without getting a good number $found=0; last; } }else{ $fiddled_rand--; } $i++; } unless ($found) { $i=$start; while ($rand) { unless (vec $string, $i++, 1) { $rand--; } } } print "$i+"; vec($string, ($i-1), 1)=1; while (vec $string, $start, 1){$start++} while (vec $string, $end, 1){$end--} $number--; } [download] I have another cunning plan to prevent that horrible slowdown near the end but it is at the cost of more memory (step 1 is thrown this abomination away ;) Cheers, R.	[reply] [d/l]