nmerriweather has asked for the wisdom of the Perl Monks concerning the following question:

I'm facing a similar directory scheme issue as this node Need directory scheme to store 400,000 images

i came across many of the same questions, answers, and realizations in my quest (most notably that dispersion requires reverse name mapping for numerical ids, frontwards for digests).

i've decided against a serial numeric value for the files, and to use a unique md5 digest for the file id

that digest will then be mapped to a series of 2char directories abcdefghij -> ab/cd/ef/abcdefghij

here's where my issue comes in: md5 offers 3 ways to represent the digest -- binary, b16 and b64

binary isn't applicable for directory mapping.
b16 will give me directories with 256 subdirs
b64 will create 4096 subdirs

a b32 representation would, however, create 1024 subdirs -- right around the magic number where many filesystems have issues with the number of files in them

That said, it seems obvious that I would want a base32 representation of the digest -- except that I have no idea how to create this representation. can anyone offer suggestions?

Replies are listed 'Best First'.
Re: Making a base32 representation of md5
by perlfan (Parson) on Mar 17, 2005 at 20:10 UTC
    Well, MD5 with the letters is hex (base 16), so converting them should be easy, but you will have to use the symbols 0-9 and A-V where:
    0-9 are the same, and: A - 11 B - 12 ... V - 31
    I can't, off the top of my head tell you the proceedure, but you could lookup converting to hexadecimal, and go from there.

    It would be just as easily to convert from the 128 bit string (1's and 0's) to base32 as well.
Re: Making a base32 representation of md5
by TedPride (Priest) on Mar 17, 2005 at 20:25 UTC
    Pardon me for asking stupid questions, but what's wrong with 16-bit? Even with only two layers of subdirectories, we're talking 65536 different slots to put things in, meaning an average of only about 6.1 items per slot. The number of images could increase by a couple factors of magnitude and still probably do ok with that many slots.

    Just use 16-bit. The simplest solution is usually the best.

Re: Making a base32 representation of md5
by eXile (Priest) on Mar 18, 2005 at 03:26 UTC
    Why not use Cache::Cache to do the hard work for you? This will create a file cache (or a memory cache if that is what you want), of several directories deep (I even think it's configurable how deep you want it to be).
Re: Making a base32 representation of md5
by BrowserUk (Patriarch) on Mar 18, 2005 at 04:17 UTC

    I've mapped base 32 to 'A'..'Z', 0 .. 5, but you could use any set of 32 symbols that you like:

    use Digest::MD5 qw[ md5 ]; sub md5_base32{ join'',map{ ('A'..'Z', 0..5 )[ ord pack 'b5', $_ ] } unpack '(A5)*', unpack'b*', md5( $_[ 0 ] ) } print md5_base32( 'fred' ); XSCAZ54XMU5W0OVYUWYRGZU0RF

    Note that the last digit will always be 0..5/'A'..'F', but that's probably okay for your purpose.

    Update: A slightly more efficient version:

    sub md5_base32_2{ join'', ('A'..'Z', 0..5 )[ unpack 'C*', pack '(b5)*', unpack '(A5)*', unpack 'b*', md5 $_[ 0 ] ] }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco.
    Rule 1 has a caveat! -- Who broke the cabal?
Re: Making a base32 representation of md5
by holli (Abbot) on Mar 17, 2005 at 20:31 UTC
    I have an application at work where I need to store unique files in a single directory. In it I am converting the file's content into a MD5 checksum and save it under that name. I simply use the md5() function and convert every byte to its ordinal value. Basically works like this:
    use strict; use Digest::MD5 qw(md5); my $d = md5("somerandomdata"); my $f = join "", map sprintf ("%03d", ord $_), split "", $d; print $f;
    That prints "086152237030061101168084252112035147251032073180" (3 bytes for every byte in the checksum.) and you can safely use that as a filename.


    holli, /regexed monk/

      Why? Taking the ordinate doesn't make the file name any more unique than not taking it. If you want safety, you would be better off using Data::UUID, which, as the docs say, will generate a UUID that "…is 128 bits long, and is guaranteed to be different from all other UUIDs/GUIDs generated until 3400 CE."

      Granted, that's limited to your domain, but I still doubt it will be a serious problem. And, by the time it causes issues, you will be dead. ;-) MD5 sums can collide, as any hash algorithm can -- it's just very hard to deliberately construct two messages with the same signature that could possibly be mistaken for each other.

      MD5 is not for establishing uniqueness, it's for signing data to validate that it has not changed since its first signing.

      Anima Legato
      .oO all things connect through the motion of the mind

        The problem with the guaranteed version of Data::UUID is that you can't recreate the same UUID a second time. Which means that if you want the same file, you can't just create the Data::UUID to find out what directory it's in - you need to scan them. What is wanted here is a hashing algorithm - put in some piece of data (possibly including characters that cannot be represented on the filesystem), get a directory to store it in, and then be able to retrieve it when you pass in the same piece of data.

        I actually have an implementation of this that is ready to go on CPAN ... as soon as my manager allows me to do so.

        It's because I store "unique" files and the best way to ensure and quickly check that (without a db or additional db-file) is to simply save them with the checksum as the name.

        As for the collission, I tought about that before. I think I'll add another checksum algo, SHA, to the name.
        Using two independent algorithms should save me from any collission. Then it's more likely the whole building tunnels into another universe spontanously.


        holli, /regexed monk/
[OT] Re: Making a base32 representation of md5
by jdalbec (Deacon) on Mar 18, 2005 at 04:52 UTC
    Another possibility (unrelated to Perl) would be to use a filesystem that doesn't have issues with large numbers of files in a directory.
Re: Making a base32 representation of md5
by ikegami (Patriarch) on Mar 17, 2005 at 20:25 UTC

    In other words, you have a 128 bit number, and you want to represent it in base32.

    my $binary_digest = ...; my @digits = (0..9, 'A'..'V'); my $base32_digest = ''; for (0..3) { # 2^32^4 = 2^128 $base32_digest = $digits[$binary_digest & 31]. $base32_digest; $binary_digest >>= 5; } print("$base32_digest\n");

    Update: Strike that. You don't start with a 128 bit number, it wouldn't fit. I'd reengineer my solution, but I must leave. Good luck.

Re: Making a base32 representation of md5
by nmerriweather (Friar) on Mar 18, 2005 at 17:10 UTC
    thanks for all the suggestions!!!

    in regards to some comments:

    b32 v b16 -- a b32 encoding would give me exactly what i wanted, a 16 is entirely doable giving me more than i need with a third level, but a 32 would give me exactly what i want in 2 levels of the structure. now i get to debate if its worth it or not through benchmarking.

    fs choice -- i want this to be portable. i dont want to be in the situation where performance is degraded because i have to install this application on a specific filesystem for whatever reason -- and with reiserfs i'd be limited to linux and wouldnt be able to use bsd (although i think it might be supported under netbsd now)