lihao has asked for the wisdom of the Perl Monks concerning the following question:

Hi, folks:

it might not be a Perl specific question, but I'd rather find a Perl solution. :-) I have over 25,000K unique strings(5-50 chars in length) and I need to generate an 12-digit unique/random hex-decimal ID for each record and import them into MySQL. new records having a different field value might be inputed later. So far, I can use MySQL's auto_increment column to make an absolutely unique seed and adjust them to 12-digit. But it looks not flexible for new input in the future.. Is there any method(hashing,MD5 or whatever) that I can use with Perl to generate 12-digit unique/random keys against over 25,000K records.

many thanks

lihao

  • Comment on Question: Generate unique/random 12-digit keys for 25,000K records, howto??

Replies are listed 'Best First'.
Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by ikegami (Patriarch) on Apr 30, 2008 at 21:15 UTC
    If your goal is to create unguessable unique keys and you don't care about leaking info about the order in which the key are created, you could concatenate a unique incrementing number with a random number.
    my $unique = time; my $key_bin = pack('N', $unique) . pack('C*', map { int(rand()*256) } 1..6); my $key_hex = unpack('H*', $key_bin);
    That gives something like
    4818e14c1662fd0dac98 \______/\__________/ | | 32-bit 48-bit unique random

    You can reduce the size of the unique portion (by using substr(pack('N', $unique), 1)) and change the size of the random portion (by changing the "6") at will.

Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by BrowserUk (Patriarch) on May 01, 2008 at 01:01 UTC
    2-digit unique/random keys against over 25,000K records.

    Using rand and digests is wrong for this.

    • To get unique random, you have to remember what you've had.

      Sounds easy, but as the list of numbers you've used already grows, the process of finding one you haven't, takes longer and longer.

    • Any truncation applied to a digest rapidly and vastly increases your chances of duplicates.

      Whatever the digest you use, duplicates aren't not impossible, just very rare. But whatever the odds of you finding a duplicate are, they (at least) doubled with every bit you throw away.

      Ie. If you start with a 128-bit (say md5) digest, convert it to a hex representation, 32-chars, and then truncate that to 12 chars, then that's 12/32 * 128 == 48 bits.

      And 128 - 48 = 80, so your new 12-character identifiers are (at least) 2*80 or 1,208,925,819,614,629,174,706,176 times more likely to encounter duplicates.

      However low the odds are to begin with, that's a huge increase in the risk.

    By far the simplest way for any reasonably low number of values, 25k is very low, is to simply generate the N unique numbers, shuffle them and take the next one of the top of the list and assign it.

    Guarenteed unique. Completely random association between the number and what it is assigned to. And trivial to code.

    use List::Util qw[ shuffle ]; my @randUnique = shuffle map{ sprintf "1%011d", $_ } 1 .. 25e3;; print $randUnique[ $_ ] for 1 .. 10;; ## First 10 100000016682 100000002653 100000013669 100000004625 100000009482 100000002763 100000022284 100000000048 100000015278 100000012155

    The arguement can be made that whilst the association between any given number and what it represents is unguessable, that guessing a valid number, regardless of what it represents is trivial. And if that is the major concern, then a small adjustment is needed.

    my @randUnique = shuffle map{ sprintf "%07d%05d", 1e7+int( rand 1e6 ), $_ } 1 .. 25e3;; print $randUnique[ $_ ] for 1 .. 10;; 1091497817847 1018472205890 1028707802676 1078720009752 1016540524132 1074148507607 1022293016846 1018341020021 1038845808717 1056634512933

    Now, guessing a valid number has a very low probablity (25e3/1e12 ~= 0.000000025), and even if the bad guy does succeed, there's no way for him to know what his guess represents. Just store your 25k (or a million; it takes no time to generate them):

    print time(); my @randUnique = shuffle map{ sprintf "%07d%05d", 1e7+int( rand 1e6 ), $_ } 1 .. 1e6; print time();; 1209603392 1209603398 ## 6 seconds for 1e6

    unique random numbers somewhere, and take the next one from the top of the pile as you need it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by kyle (Abbot) on Apr 30, 2008 at 20:27 UTC

    You hint at a solution already: Digest::MD5 or Digest::SHA1.

    use Digest::SHA1 'sha1_hex'; my $unique_string = 'blahblahblah'; my $sha1 = sha1_hex( $unique_string ); my $out = substr $sha1, 0, 12;

      thanks, can this guarentee the uniqueness, previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs. 12-digit sounds OK, at least I didn't get duplicated records by testing on 780K sub-records with Digest::SHA1. Is 12-digit enough for 50 Million unique strings ? many thanks

      lihao

        A 12 digit hex string can have 281_474_976_710_656 different values, so it's certainly capable of representing 50_000_000 different strings. That said, Digest::SHA1 does not guarantee that two different strings will result in different hashes. Quite the opposite. If all 50M inputs are all different outputs, there's a certain degree of luck involved. (I'm too lazy to compute for you the probability of a collision, but it looks pretty small.)

        If you want the output to be exactly as unique as the input, you'll have to come up with some kind of encoding. That is, it will be possible to go from the encoded version back to the original string.

        The problem is that your inputs have a significantly larger alphabet than the 16 hex digits, and they're longer than the 12 digits you want to store. (I'm guessing a little here because you haven't told us anything about the format of the inputs.) As such, I think it's safe to say there's no way to encode every possible input string into the output format you want. There just isn't enough "space" to do it in.

        That leaves two options, in my opinion.

        1. Use a hashing algorithm and rely on luck.
        2. Enumerate the inputs with a serial number until you run out of them (i.e., let the database do it).
        Are 12 hex-digits enough for 50 Million unique strings?
        Let's ask perl:

        perl -e 'print ((16**12 >= 50*10**6) ? "Yes" : "No")'

        Answer:
        Yes

        You get max 16**12 = 281474976710656 different values with 12 hex-digits which is far more then 50000000.

        Cutting down SHA1 or MD5 to 12-digits will increase the possibility of collisions a lot. Can you not try to use a longer string or use binary?

        This is a variant of the "birthday problem." There are many mathematical discussions of the problem, but the basic idea and equations can be found at http://en.wikipedia.org/wiki/Birthday_attack. For 25,000 records and the 32-bit address space that you ran into trouble with, it's around seven percent. With a 48-bit address space, the probability that you will find one collision is tap - tap - tap about one in nine hundred thousand. That might still be cutting it a little close, if your data set grows over time.

        Update: Sorry, I missed the question about fifty million records. With a 48-bit address space, that gives almost a 99% probability of at least one collision(!) The table in the article shows that 64 bits (sixteen hex digits) gives a one in a million chance of a collision for about sixty one million records.

        previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs
        If you got duplicates in the output, you can be sure the list you used as input had duplicates also. Hashes like MD5 are designed to not produce collisions (different inputs producing same output).
        Either that, or some programming error in your script using Digest::MD5.
Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by syphilis (Archbishop) on May 01, 2008 at 01:35 UTC
    I need to generate an 12-digit unique/random hex-decimal ID

    It's somewhat obvious, but if you could raise the base of your 12-digit number from 16 to 36 or 62 or 64 or 256 (the higher the better) then the chance of duplication is greatly diminished.

    Cheers,
    Rob
Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by mobiusinversion (Beadle) on May 01, 2008 at 01:40 UTC
    Here is one way to do it:
    use Time::HiRes qw(time); my $id_number = time; $id_number =~ s/\.//; $id_number =~ s/(.{12}).*/$1/;
    Simply assign an id-number to a record every 1/100th of a second and you should be fine via this strategy for about 150 years with unlimited flexibility.

    Now the "actual" answer to your "actual" question (without high resolution timer hacking) depends on two things:

    1) the number of characters that you allow to into your records. For example, you could set up your database to only allow the 97 ascii characters that appear on generic western keyboards.

    2) the maximum information entropy that you allow into the record set via your data model (for example, would you allow some stranger to introduce pseudo-random text into the rows? probably not). This is also known as the Kolmogorov complexity

    Of course, you can generate about 2.8 x 1014 12-digit hex ids which is far greater than the number of records you currently have and is still far greater than 50,000,000.

    However, you stated your real question in passing "But it looks not flexible for new input in the future.".

    Consider this code, which for an n-digit id on m characters (like 12 and 16), will tell you the upper and lower bounds of strings of some length (like 5 and 50) on k letters (like 97) that can be guaranteed unique ids given a uniform distribution of input data (maximal entropy).
    my $id_length = 12; my $id_char_set = 16; my $min_data_length = 5; my $max_data_length = 50; my $tiny_alphabet_size = 26; # a-z my $small_alphabet = 37; # a-z, 0-9 and \s my $medium_alphabet_size = 97; # generic western keyboards my $big_alphabet_size = 65536; # utf8 my $world_alphabet_size = 100713; # unicode standard 5.1 my $counter = 0; for my $i($tiny_alphabet_size..$world_alphabet_size){ for my $j($min_data_length..$max_data_length){ for my $k($j..$max_data_length){ my $total_size = geometric_series($i,$j,$k); if($total_size < $id_char_set ** $id_length){ printf " %-5s %-5s %-5s %-5s %-30s\n", $counter++,$i,$j,$k,$total_size; } } } } sub geometric_series { my($base,$min,$max) = @_; my $x; for my $i($min..$max){ $x += $base ** $i; } return $x; }
    You will find that the answer to your "actual" question, regarding assigning unique ids with the flexibility of unknown inputs via a functional form, and without collision is a resounding no.

    If you would consider enforcing integrity on the input data, and limiting the input character set, then finding a functional form of generating 12-digit hex id numbers on-the-fly would become a more well formulated idea.
Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by shmem (Chancellor) on Apr 30, 2008 at 20:50 UTC
    I need to generate an 12-digit unique/random

    Randomness is... just random, so you may get random duplicates with whatever random technique. Probability may vary.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}