Re^2: Question: Generate unique/random 12-digit keys for 25,000K records, howto??

Replies are listed 'Best First'.
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by kyle (Abbot) on Apr 30, 2008 at 21:09 UTC
A 12 digit hex string can have `281_474_976_710_656` different values, so it's certainly capable of representing `50_000_000` different strings. That said, Digest::SHA1 does not guarantee that two different strings will result in different hashes. Quite the opposite. If all 50M inputs are all different outputs, there's a certain degree of luck involved. (I'm too lazy to compute for you the probability of a collision, but it looks pretty small.) If you want the output to be exactly as unique as the input, you'll have to come up with some kind of encoding. That is, it will be possible to go from the encoded version back to the original string. The problem is that your inputs have a significantly larger alphabet than the 16 hex digits, and they're longer than the 12 digits you want to store. (I'm guessing a little here because you haven't told us anything about the format of the inputs.) As such, I think it's safe to say there's no way to encode every possible input string into the output format you want. There just isn't enough "space" to do it in. That leaves two options, in my opinion. Use a hashing algorithm and rely on luck. Enumerate the inputs with a serial number until you run out of them (i.e., let the database do it).	[reply] [d/l] [select]
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by mscharrer (Hermit) on Apr 30, 2008 at 21:16 UTC
Are 12 hex-digits enough for 50 Million unique strings? Let's ask perl: `perl -e 'print ((16*12 >= 50106) ? "Yes" : "No")'` Answer: `Yes` You get max 1612 = 281474976710656 different values with 12 hex-digits which is far more then 50000000. Cutting down SHA1 or MD5 to 12-digits will increase the possibility of collisions a lot. Can you not try to use a longer string or use binary?	[reply] [d/l] [select]
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by quester (Vicar) on May 01, 2008 at 07:34 UTC
This is a variant of the "birthday problem." There are many mathematical discussions of the problem, but the basic idea and equations can be found at http://en.wikipedia.org/wiki/Birthday_attack. For 25,000 records and the 32-bit address space that you ran into trouble with, it's around seven percent. With a 48-bit address space, the probability that you will find one collision is tap - tap - tap about one in nine hundred thousand. That might still be cutting it a little close, if your data set grows over time. Update: Sorry, I missed the question about fifty million records. With a 48-bit address space, that gives almost a 99% probability of at least one collision(!) The table in the article shows that 64 bits (sixteen hex digits) gives a one in a million chance of a collision for about sixty one million records.	[reply]
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by carol (Beadle) on Apr 30, 2008 at 21:17 UTC
previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs If you got duplicates in the output, you can be sure the list you used as input had duplicates also. Hashes like MD5 are designed to not produce collisions (different inputs producing same output). Either that, or some programming error in your script using Digest::MD5.	[reply]
Re^4: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by kyle (Abbot) on Apr 30, 2008 at 21:29 UTC
That's actually not that far fetched. `use Digest::MD5 'md5_hex'; my $x = 'a'; my %found; my $key; while (1) { $key = substr md5_hex($x), 0, 8; if ( exists $found{$key} ) { my $first_md5 = md5_hex( $found{$key} ); my $second_md5 = md5_hex( $x ); die "found $key at $x ($second_md5) and $found{$key} ($first_m +d5)\n"; } else { $found{$key} = $x; } $x++; } __END__ found a986d9ee at bwma (a986d9ee140c5acbf0d51c00bc5a7810) and kot (a98 +6d9ee785f7b5fdd68bb5b86ee70e0)` [download] With only eight hex digits, there's only `4_294_967_296` different hashes. You can exhaust that pretty quck.	[reply] [d/l] [select]
Re^4: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by shmem (Chancellor) on Apr 30, 2008 at 21:36 UTC
An md5 hash of a large dataset is always a reduction of information. Different sets of information can result in the same reduction, which is actually the case with md5. There's a paper about md5 collisions. --shmem _($_=" "x(1<<5)."?\n".q·/)Oo. G°\ / /\_¯/(q / ---------------------------- \__(m.====·.(_("always off the crowd"))."· ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}	[reply]
Re^4: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by mscharrer (Hermit) on Apr 30, 2008 at 21:22 UTC
He was using MD5 and then only used the 8 first (or last) characters. This can lead to collisions. (Also in theory two MD5s from two different inputs can be identical, but this is very unlikely)	[reply]
Re^5: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? by carol (Beadle) on Apr 30, 2008 at 22:04 UTC
You are right. I didn't realise he would take a substring from the result.	[reply]


Problems? Is your data what you think it is?
	PerlMonks