Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Question: Generate unique/random 12-digit keys for 25,000K records, howto??

by lihao (Monk)
on Apr 30, 2008 at 20:44 UTC ( [id://683790]=note: print w/replies, xml ) Need Help??


in reply to Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
in thread Question: Generate unique/random 12-digit keys for 25,000K records, howto??

thanks, can this guarentee the uniqueness, previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs. 12-digit sounds OK, at least I didn't get duplicated records by testing on 780K sub-records with Digest::SHA1. Is 12-digit enough for 50 Million unique strings ? many thanks

lihao

  • Comment on Re^2: Question: Generate unique/random 12-digit keys for 25,000K records, howto??

Replies are listed 'Best First'.
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by kyle (Abbot) on Apr 30, 2008 at 21:09 UTC

    A 12 digit hex string can have 281_474_976_710_656 different values, so it's certainly capable of representing 50_000_000 different strings. That said, Digest::SHA1 does not guarantee that two different strings will result in different hashes. Quite the opposite. If all 50M inputs are all different outputs, there's a certain degree of luck involved. (I'm too lazy to compute for you the probability of a collision, but it looks pretty small.)

    If you want the output to be exactly as unique as the input, you'll have to come up with some kind of encoding. That is, it will be possible to go from the encoded version back to the original string.

    The problem is that your inputs have a significantly larger alphabet than the 16 hex digits, and they're longer than the 12 digits you want to store. (I'm guessing a little here because you haven't told us anything about the format of the inputs.) As such, I think it's safe to say there's no way to encode every possible input string into the output format you want. There just isn't enough "space" to do it in.

    That leaves two options, in my opinion.

    1. Use a hashing algorithm and rely on luck.
    2. Enumerate the inputs with a serial number until you run out of them (i.e., let the database do it).
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by mscharrer (Hermit) on Apr 30, 2008 at 21:16 UTC
    Are 12 hex-digits enough for 50 Million unique strings?
    Let's ask perl:

    perl -e 'print ((16**12 >= 50*10**6) ? "Yes" : "No")'

    Answer:
    Yes

    You get max 16**12 = 281474976710656 different values with 12 hex-digits which is far more then 50000000.

    Cutting down SHA1 or MD5 to 12-digits will increase the possibility of collisions a lot. Can you not try to use a longer string or use binary?

Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by quester (Vicar) on May 01, 2008 at 07:34 UTC
    This is a variant of the "birthday problem." There are many mathematical discussions of the problem, but the basic idea and equations can be found at http://en.wikipedia.org/wiki/Birthday_attack. For 25,000 records and the 32-bit address space that you ran into trouble with, it's around seven percent. With a 48-bit address space, the probability that you will find one collision is tap - tap - tap about one in nine hundred thousand. That might still be cutting it a little close, if your data set grows over time.

    Update: Sorry, I missed the question about fifty million records. With a 48-bit address space, that gives almost a 99% probability of at least one collision(!) The table in the article shows that 64 bits (sixteen hex digits) gives a one in a million chance of a collision for about sixty one million records.

Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by carol (Beadle) on Apr 30, 2008 at 21:17 UTC
    previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs
    If you got duplicates in the output, you can be sure the list you used as input had duplicates also. Hashes like MD5 are designed to not produce collisions (different inputs producing same output).
    Either that, or some programming error in your script using Digest::MD5.

      That's actually not that far fetched.

      use Digest::MD5 'md5_hex'; my $x = 'a'; my %found; my $key; while (1) { $key = substr md5_hex($x), 0, 8; if ( exists $found{$key} ) { my $first_md5 = md5_hex( $found{$key} ); my $second_md5 = md5_hex( $x ); die "found $key at $x ($second_md5) and $found{$key} ($first_m +d5)\n"; } else { $found{$key} = $x; } $x++; } __END__ found a986d9ee at bwma (a986d9ee140c5acbf0d51c00bc5a7810) and kot (a98 +6d9ee785f7b5fdd68bb5b86ee70e0)

      With only eight hex digits, there's only 4_294_967_296 different hashes. You can exhaust that pretty quck.

      An md5 hash of a large dataset is always a reduction of information. Different sets of information can result in the same reduction, which is actually the case with md5. There's a paper about md5 collisions.

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
      He was using MD5 and then only used the 8 first (or last) characters. This can lead to collisions.

      (Also in theory two MD5s from two different inputs can be identical, but this is very unlikely)

        You are right. I didn't realise he would take a substring from the result.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://683790]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-04-25 05:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found