in reply to Re: Question: Generate unique/random 12-digit keys for 25,000K records, howto?? in thread Question: Generate unique/random 12-digit keys for 25,000K records, howto??
thanks, can this guarentee the uniqueness, previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs. 12-digit sounds OK, at least I didn't get duplicated records by testing on 780K sub-records with Digest::SHA1. Is 12-digit enough for 50 Million unique strings ? many thanks
lihao
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by kyle (Abbot) on Apr 30, 2008 at 21:09 UTC
|
A 12 digit hex string can have 281_474_976_710_656 different values, so it's certainly capable of representing 50_000_000 different strings. That said, Digest::SHA1 does not guarantee that two different strings will result in different hashes. Quite the opposite. If all 50M inputs are all different outputs, there's a certain degree of luck involved. (I'm too lazy to compute for you the probability of a collision, but it looks pretty small.)
If you want the output to be exactly as unique as the input, you'll have to come up with some kind of encoding. That is, it will be possible to go from the encoded version back to the original string.
The problem is that your inputs have a significantly larger alphabet than the 16 hex digits, and they're longer than the 12 digits you want to store. (I'm guessing a little here because you haven't told us anything about the format of the inputs.) As such, I think it's safe to say there's no way to encode every possible input string into the output format you want. There just isn't enough "space" to do it in.
That leaves two options, in my opinion.
- Use a hashing algorithm and rely on luck.
- Enumerate the inputs with a serial number until you run out of them (i.e., let the database do it).
| [reply] [d/l] [select] |
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by mscharrer (Hermit) on Apr 30, 2008 at 21:16 UTC
|
Are 12 hex-digits enough for 50 Million unique strings?
Let's ask perl:
perl -e 'print ((16**12 >= 50*10**6) ? "Yes" : "No")'
Answer:
Yes
You get max 16**12 = 281474976710656 different values with 12 hex-digits which is far more then 50000000.
Cutting down SHA1 or MD5 to 12-digits will increase the possibility of collisions a lot. Can you not try to use a longer string or use binary?
| [reply] [d/l] [select] |
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by quester (Vicar) on May 01, 2008 at 07:34 UTC
|
This is a variant of the "birthday problem." There are many mathematical discussions of the problem, but the basic idea and equations can be found at http://en.wikipedia.org/wiki/Birthday_attack. For 25,000 records and the 32-bit address space that you ran into trouble with, it's around seven percent. With a 48-bit address space, the probability that you will find one collision is tap - tap - tap about one in nine hundred thousand. That might still be cutting it a little close, if your data set grows over time.
Update: Sorry, I missed the question about fifty million records. With a 48-bit address space, that gives almost a 99% probability of at least one collision(!) The table in the article shows that 64 bits (sixteen hex digits) gives a one in a million chance of a collision for about sixty one million records. | [reply] |
Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by carol (Beadle) on Apr 30, 2008 at 21:17 UTC
|
previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs
If you got duplicates in the output, you can be sure the list you used as input had duplicates also. Hashes like MD5 are designed to not produce collisions (different inputs producing same output).
Either that, or some programming error in your script using Digest::MD5.
| [reply] |
|
use Digest::MD5 'md5_hex';
my $x = 'a';
my %found;
my $key;
while (1) {
$key = substr md5_hex($x), 0, 8;
if ( exists $found{$key} ) {
my $first_md5 = md5_hex( $found{$key} );
my $second_md5 = md5_hex( $x );
die "found $key at $x ($second_md5) and $found{$key} ($first_m
+d5)\n";
}
else {
$found{$key} = $x;
}
$x++;
}
__END__
found a986d9ee at bwma (a986d9ee140c5acbf0d51c00bc5a7810) and kot (a98
+6d9ee785f7b5fdd68bb5b86ee70e0)
With only eight hex digits, there's only 4_294_967_296 different hashes. You can exhaust that pretty quck. | [reply] [d/l] [select] |
|
| [reply] |
|
| [reply] |
|
You are right. I didn't realise he would take a substring from the result.
| [reply] |
|
|