in reply to Truely Unique Code Set?

I liked this question, as it presented some interesting problems. You're choosing unique random permutations from a set that is too large to hold in memory.

At first blush maybe one would think, "Generate all permutations of digits, shuffle, and pick." But of course that wouldn't work because there are 8.2e14 permutations of the set [ 'a'..'z', 3, 4, 6, 7, 9 ] taken ten digits at a time, which would exceed memory and time. Another approach would be to just permute as many times as needed with a permutation iterator from Algorithm::Permute. But then you lose randomness. Even if you shuffle the first six million permutations, you still lose randomness since you will never get to see any permutations that occur after the six-millionth.

So the method you were using of generating random strings and tossing out duplicates seems to be on track. However, there were some improvements and optimizations that could be made.

First, rather than generate a big list and then throw out dupes, you could generate an element, test for uniqueness, and if not unique, generate again, repeating that process six million times. Then you don't really need to keep as careful track as to how many duplicates were tossed out, and you know at all times how many uniques you've produced.

Next, by encapsulating the "generate-try-regenerate if necessary" code in its own subroutine you abstract away the complexity from your main loop of testing for uniqueness.

Another optimization is this: Each time you call your current subroutine you're re-generating the list of ( 'a' .. 'z', 3, 4, 6, 7, 9). That could be created at a broader scope so that it's not necessary to re-create it on every iteration. Alternatively you could just pass a ref to this list of acceptable characters to the sub. But I'll provide an example where we take the first approach by binding the list to a subroutine.

Now that we've gotten this far it occurs to me that we could manufacture a string generating iterator function that encapsulates all of the complexity, abstracting it away from the main loop. In so doing, we would bind by way of a closure all of the details that compose the iterator, further improving simplicity within the main loop.

To accomplish all this binding and manufacturing of a sub-ref we will call a subroutine one time that binds values, and returns the sub-ref. Then we just call that sub-ref repeatedly, and each time we get a fresh, unique, random code.

Here's the result:

#!/usr/bin/env perl #program to generate 6 Million unique codes, Oct 15, 2011. use strict; use warnings; use autodie; use v5.10; # Just for feature "say" # Some criteria. my @digits_used = ( 'a' .. 'z', 3, 4, 6, 7, 9 ); my $string_length = 10; my $need = 6_000_000; my $filename = 'testfile.txt'; # Takes a string length, and a list of digits allowed, and returns an # iterator function that on each call returns a unique string. sub create_random_string_generator { # Our bind values: my $size = shift; my @contains = @_; # Our duplicate cache. my %found; # Here's where we create an iterator function with bound criteria. return sub { my $rand_string; do{ my @rand_digits; push @rand_digits, $contains[ rand @contains ] for 1 .. $size; $rand_string = join '', @rand_digits; } while( $found{$rand_string}++ ); return $rand_string; } } # Now we bind our criteria and create a generator iterator function. my $string_gen = create_random_string_generator( $string_length, @digits_used ); open my $out_fh, '>', $filename; for( my $count = 0; $count < $need; $count++ ) { say $out_fh $string_gen->(); } close $out_fh;

This is an efficient and flexible method of generation that pulls the complexity away from the main loop.

There are two limitations: First, while the set of numbers created doesn't have to fit in memory, the set of values already used does, which is the same thing for all practical purposes. By the end of the run, there will be six million keys in the %found hash. If the number of unique codes exceeds what you can reasonably hold in memory, you will need to tie the hash to a file. ...but that leads to the second problem:

On the 2nd iteration, the chances of a collision with an already picked number is very low; 1 in 8.2e14. But by the six-millionth iteration, the chances are 1 in 1.4 billion per iteration. That's still going to yield a very low collision rate. The odds of having a collision are relatively low. But if you tie your %found hash to a file and decide to create 8.2e14 unique random codes you're in trouble, because the collision probability will approach 100% on each try as iterations increase.

It's improbable you'll ever need enough codes to have to worry about the cost of avoiding collisions though. Even if you generated 2 billion codes, the odds of generating a collision rise to 1 in about 410 thousand per iteration as you approach 2 billion; significant enough that you cannot forgo testing, but not significant enough that the collision rate will impact performance.

Here are some tests you can run from the command line to verify that it worked...

# Verify that there are no duplicates. perl -nE 'say $_ if $seen{$_}++;' testfile.txt # We want no output. # Verify that there are 6000000 lines. perl -nE '$count++; END{ say $count; }' testfile.txt # We want 6000000 as the output. # Verify that we only have digits between a..z and 3, 4, 6, 7, 9. perl -nE 'say "$. : $_" if m/[^a-z34679\n]/' testfile.txt # We want no output.

Update: If we think of the function that was manufactured as being a variation on a first class object, it becomes easy to transform the above "Higher Order" or "Functional Programming" method into an "Object Oriented" paradigm. In this case the manufactured function is a sort of light-weight object. If we were to transform it to an object using Perl's traditional OO model, here's how it would look (along the way I improved the string-generation code, and those changes themselves could be put back into the functional approach).

package CodeGenerator; use strict; use warnings; sub new { my $class = shift; my $self = bless {}, $class; $self->{string_size} = shift; $self->{digits_used} = [ @_ ]; $self->{seen} = {}; # Not necessary, but helpful to # those reading the code. return $self; } sub generate { my $self = shift; my $rand_string; do{ $rand_string = ''; $rand_string .= $self->{digits_used}[ rand @{$self->{digits_used}} ] for 1 .. $self->{string_size}; } while( $self->{seen}{$rand_string}++ ); return $rand_string; } 1; package main; use strict; use warnings; use autodie; use v5.10; # For "say". # Some criteria. my @digits_used = ( 'a' .. 'z', 3, 4, 6, 7, 9 ); my $string_length = 10; my $need = 6_000_000; my $filename = 'testfile.txt'; my $codegen = CodeGenerator->new( $string_length, @digits_used ); open my $out_fh, '>', $filename; for( my $count = 0; $count < $need; $count++ ) { say $out_fh $codegen->generate(); } close $out_fh;

Whichever of these approaches are used, there's no need to verify uniqueness beyond what is happening internally (this isn't an excuse to forgo unit testing, but saying explicit verification isn't necessary outside of a test environment). It would be impossible for non-unique codes to make it into the output file. However, the one-liners I provided will provide the verification if needed.

As for which technique to use.... well, first, they're both the same technique, expressed using different paradigms. So which paradigm... Frankly I find the OO paradigm too verbose for such a little problem, although the functional solution may be harder to read. The functional could be made to work without binding any values other than the cache, and in so doing, you wouldn't have to manufacture a function. In other words, it could be turned into a simple Procedural approach. But then it's less interesting. ;)

Update2: Thanks to Perlbotics for pointing out that my original estimates of collision probability were off. Hopefully it's corrected now.


Dave

Replies are listed 'Best First'.
Re^2: Truely Unique Code Set?
by richCarthew (Initiate) on Oct 15, 2011 at 22:04 UTC
    Excellent! Thanks Dave, this is an excellent exposition of the key characters in this little program, I think I follow roughly 90% of the optimization ideas that you've presented. I'm going to study your improved program (in addition to Perlbotics work) to better my programming, such as it is. My forte is in fine art, and sculpture; I've only recently evolved into coding as it is a wonderful artform in itself when allied with the interactive possibilities it can offer in sculptural works. I am currently working on a piece where I hand draw the trace lines of a microprocessor to resemble a face, where the ATMEGA 328 chip is in the 'nose'. Then I etch the transparent PCB to form the circuit, carve the rear and illuminate it via C to control LEDs integral in the work. But I digress. Thank you so much for taking the time to work on this. Your insights will extend into my thinking when coding going forward. Rich.
    I dove into the C, found a Shell and inside was a PERL.
      Hello again Dave, I'm not sure if you'll be notified of a new reply to this thread, however, I'll ask away in the hope that it will be seen.
      I've generated the codes successfully and am now wishing to import them (as a CSV text file) into MySQL to form a table. I'm stumped as to where I could make an adjustment near the 'say' method (within your kindly provided program) so that a comma is inserted directly after each random value, so that the resultant file could then be imported into MySQL via MyPhpAdmin GUI.
      Thanks!
      Rich
      I dove into the C, found a Shell and inside was a PERL.

        If you have access to the MySQL database from the machine you are running the script on you can instead insert the data directly into the database using DBI and DBD::mysql.

        True laziness is hard work