Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Automatically distributing and finding files in subdirectories

by Fletch (Bishop)
on Jul 18, 2006 at 02:40 UTC ( [id://561911]=note: print w/replies, xml ) Need Help??


in reply to Automatically distributing and finding files in subdirectories

When we hit a similar problem at work what I did was implement a module that took a base directory name and created two layers of 00..ff subdirectories (so 00/00, 00/01, ..., ff/fe, ff/ff). It then had a function that would take a "path" and return a real OS path for that file generated by running the filename through Digest::MD5 and splitting off the first pairs of hex digits. So given the base directory /tmp/hashed, the file fooble would get located in /tmp/hashed/03/63/03638a39d7858a61a982a1f21b33c215.

All of your calls to open should pass through the hashing function, or you can make a hashed_open that you call instead. If that still leaves too many files in any one subdirectory (although it should split them so there's only 1/64k files in any one directory) you can add another layer of subdirs.

Replies are listed 'Best First'.
Re^2: Automatically distributing and finding files in subdirectories
by graff (Chancellor) on Jul 19, 2006 at 02:02 UTC
    If I understand your process correctly, wouldn't there be at least the slightest little worry that two different (original) file names would generate the same MD5 hash?

    I suppose that if you just make a list of the file names and their md5 sigs first, you could spot collisions before actually moving stuff into the new directory structure. But if you have to add files to the structure over time, you need to check for the existence of a given md5 "path/name" before storing a new file there (and then figure out a proper way to avoid collisions while maintaining correct mappings between original and hashed names).

      Correct. If that 1 in 2128 possibility bothers you there's always Digest::SHA1 for 1 in 2160. Or keep a DBM mapping of "path" to hash and check for collisions when adding a new "path" entry.

      Update: Left out the chance of collision for SHA1.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://561911]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-03-29 12:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found