in reply to Making a base32 representation of md5

I have an application at work where I need to store unique files in a single directory. In it I am converting the file's content into a MD5 checksum and save it under that name. I simply use the md5() function and convert every byte to its ordinal value. Basically works like this:
use strict; use Digest::MD5 qw(md5); my $d = md5("somerandomdata"); my $f = join "", map sprintf ("%03d", ord $_), split "", $d; print $f;
That prints "086152237030061101168084252112035147251032073180" (3 bytes for every byte in the checksum.) and you can safely use that as a filename.


holli, /regexed monk/

Replies are listed 'Best First'.
Re^2: Making a base32 representation of md5
by legato (Monk) on Mar 17, 2005 at 22:15 UTC

    Why? Taking the ordinate doesn't make the file name any more unique than not taking it. If you want safety, you would be better off using Data::UUID, which, as the docs say, will generate a UUID that "…is 128 bits long, and is guaranteed to be different from all other UUIDs/GUIDs generated until 3400 CE."

    Granted, that's limited to your domain, but I still doubt it will be a serious problem. And, by the time it causes issues, you will be dead. ;-) MD5 sums can collide, as any hash algorithm can -- it's just very hard to deliberately construct two messages with the same signature that could possibly be mistaken for each other.

    MD5 is not for establishing uniqueness, it's for signing data to validate that it has not changed since its first signing.

    Anima Legato
    .oO all things connect through the motion of the mind

      The problem with the guaranteed version of Data::UUID is that you can't recreate the same UUID a second time. Which means that if you want the same file, you can't just create the Data::UUID to find out what directory it's in - you need to scan them. What is wanted here is a hashing algorithm - put in some piece of data (possibly including characters that cannot be represented on the filesystem), get a directory to store it in, and then be able to retrieve it when you pass in the same piece of data.

      I actually have an implementation of this that is ready to go on CPAN ... as soon as my manager allows me to do so.

      It's because I store "unique" files and the best way to ensure and quickly check that (without a db or additional db-file) is to simply save them with the checksum as the name.

      As for the collission, I tought about that before. I think I'll add another checksum algo, SHA, to the name.
      Using two independent algorithms should save me from any collission. Then it's more likely the whole building tunnels into another universe spontanously.


      holli, /regexed monk/