in reply to Re^2: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
in thread Question: Generate unique/random 12-digit keys for 25,000K records, howto??

previously my boss gave me 8-digits, and when I used Digest::MD5, I got duplicated IDs
If you got duplicates in the output, you can be sure the list you used as input had duplicates also. Hashes like MD5 are designed to not produce collisions (different inputs producing same output).
Either that, or some programming error in your script using Digest::MD5.
  • Comment on Re^3: Question: Generate unique/random 12-digit keys for 25,000K records, howto??

Replies are listed 'Best First'.
Re^4: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by kyle (Abbot) on Apr 30, 2008 at 21:29 UTC

    That's actually not that far fetched.

    use Digest::MD5 'md5_hex'; my $x = 'a'; my %found; my $key; while (1) { $key = substr md5_hex($x), 0, 8; if ( exists $found{$key} ) { my $first_md5 = md5_hex( $found{$key} ); my $second_md5 = md5_hex( $x ); die "found $key at $x ($second_md5) and $found{$key} ($first_m +d5)\n"; } else { $found{$key} = $x; } $x++; } __END__ found a986d9ee at bwma (a986d9ee140c5acbf0d51c00bc5a7810) and kot (a98 +6d9ee785f7b5fdd68bb5b86ee70e0)

    With only eight hex digits, there's only 4_294_967_296 different hashes. You can exhaust that pretty quck.

Re^4: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by shmem (Chancellor) on Apr 30, 2008 at 21:36 UTC

    An md5 hash of a large dataset is always a reduction of information. Different sets of information can result in the same reduction, which is actually the case with md5. There's a paper about md5 collisions.

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re^4: Question: Generate unique/random 12-digit keys for 25,000K records, howto??
by mscharrer (Hermit) on Apr 30, 2008 at 21:22 UTC
    He was using MD5 and then only used the 8 first (or last) characters. This can lead to collisions.

    (Also in theory two MD5s from two different inputs can be identical, but this is very unlikely)

      You are right. I didn't realise he would take a substring from the result.