I put this in the code archives as
Directory Hashing Algorithm. Reprinted
here for convenience.
The problem with directory "hashing" is you have to do it
right or you're not solving any problems. Using substr()
or XOR-math seems to create a pseudo-random distribution,
but similar strings often end up getting filed together.
That is, if someone uploads picture0001.jpg through
picture9999.jpg, your "hash" might not distribute them
very well, and performance will suffer.
Since the RC5-hash is a fine piece of work, and I'm not
in any position to achieve better, I used the "secret"
RC5 mode of the
crypt function. Despite the fact
that I'm using the same "salt" every time, the output
is quite random, and a change as small as one bit can
create big waves, as any good hash should.
QMail uses a similar
technique to store e-mail messages, as you can imagine
that there can be thousands of these on any given server,
and that access time is paramount.
#!/usr/bin/perl -w
use strict;
# hashed - Create a hashed directory path for a given filename
# using an RC5-hash generated by the crypt() function.
sub hashed
{
my ($name) = @_;
# Send the input into the grinder until it comes out the
# right size. It shrinks blocks of up to 12 characters
# into only two with each pass.
while (length($name) > 4)
{
my $crypt;
foreach ($name =~ /.{1,12}/gs)
{
$crypt .= substr(crypt($_,'$1$ABCDEFGH'),12,2)
+;
}
$name = $crypt;
}
# Fix unruly characters, as crypt will gleefully return
# '/' in the hashed strings. These are converted to 'Z'
$name =~ y!/!Z!;
# Split the returned string into a full path, including
# the specified filename.
return join ('/', ($name =~ /../g), $_[0]);
}
print hashed ("thisimage.gif"),"\n";
print hashed ("thisimage.jpg"),"\n";
print hashed ("thisimage1.gif"),"\n";
print hashed ("thisimage2.gif"),"\n";
The output, for the curious, is:
0u/uh/thisimage.gif
VO/Bz/thisimage.jpg
68/g1/thisimage1.gif
dT/g1/thisimage2.gif
If you over-hash, you end up with far too many directories
and not enough files. Feel free to tweak this to make it
better fit your application.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.