Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Need directory scheme to store 400,000 images

by markjugg (Curate)
on Apr 12, 2004 at 15:46 UTC ( #344409=perlquestion: print w/replies, xml ) Need Help??

markjugg has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm building an image upload system that is being designed to store 400,000 images.

I'm seeking advice and prior art for how to store a huge number of files on the file system. I'm seeking alternatives to dumping them all in a single directory. :)

All the file names will contain a unique ID corresponding to a database row. So, file names may look like:

123.jpg

or

123456.gif

What I've thought of so far is to store files like this:

0/00/000/123.jpg
1/12/123/123456.jpg
I've done the math right, this means that each final directory will house ~1000 images, assuming less than 1,000,000 images in the system. I imagine there is a better design, and perhaps an existing module addressing this. However, I don't even know what terms I would use to search for this. All help appreciated!
  • Comment on Need directory scheme to store 400,000 images

Replies are listed 'Best First'.
•Re: Need directory scheme to store 400,000 images
by merlyn (Sage) on Apr 12, 2004 at 16:14 UTC
    You'll probably want to md5 the name to more evenly distribute the prefix letters, then use one or two of the hex characters at each level until the final directories are small enough.
    use Digest::MD5 qw(md5_hex); my $name = "123456.jpg"; my $path = md5_hex($name); $path =~ s#^(.)(.)(.).*#$1/$2/$3/$name#;
    Three hex characters will distribute 400K files into 100 files per directory. Cache::FileCache uses a similar scheme.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      We decided to use md5 over the other options. Some reasons included:
      • md5 creates a balanced tree. With the numerical way, directories would vary from having 1 to 10,000 files in them, assuming 3 levels of directories in the form "1/2/3".
      • IDs of less than 3 characters get treated the same way. (With the numerical version, files ended up in higher level directories, or got padded with zeros).
      • It's still not very hard to find a directory "by hand" if a programmer needs to do that. Using md5 on the ID on the command line will quickly reproduce the result. This would work as well:
        find ./uploads_dir -name '1234.jpg'

        I also found out we need to plan for more like 1.5 million images.

Re: Need directory scheme to store 400,000 images
by kvale (Monsignor) on Apr 12, 2004 at 16:09 UTC
    Your directory scheme is reasonable; this is similar to what CPAN uses for storing files in the authors' directory. You might check the CPAN.pm module in your installation for details on how to use this sort of scheme.

    One difference is that takes the prefix of a file name, allowing the suffix to be of arbitrary length:

    123.jpg -> /1/12/123.jpg 114884.jpg -> /1/11/114884.jpg
    This has the advantage of (1) having a uniform algorithm over all files to generate the file path and (2) being exapndable in your case to beyond 1_000_000 files.

    -Mark

      The digits in the end of the number are more likely to be uniformly distributed:

      123.jpg -> /3/23/123.jpg 114884.jpg -> /4/84/114884.jpg
Re: Need directory scheme to store 400,000 images
by liz (Monsignor) on Apr 12, 2004 at 16:31 UTC
    If you want to do it really simple, use a ReiserFS partition. From the front page blurb:

    Do you want a million files in a directory, and want to create them fast? No problem.

    I've used ReiserFS in one situation in the past where I just copied a tree structure of about 70Gb onto a ReiserFS file system: in the end, it only used about 55 Gb. Never had any trouble at all. Highly recommended!

    Liz

      I agree. Reiser does an excellent job of handling the many-files-in-one-dir problem, avoiding the need for artificial hashing schemes.

      Regarding reliability, I used reiser3 now exclusively on my 2.6.2 laptop and have done every murderous thing possible to it including hard poweroffs in busy-writes and numerous attempts to get software-suspend working (got it in the end but did many horrible things to my system in the process, reiser took it all without blinking).

      I understand that there were problems in the past, and I would highly recommend that solid testing be done prior to deployment on a production platform, but I really do think the solution superior to the directory hashing workaround if the option is available.

      Interesting to know. I've personally had bad experiences with ReiserFS, at least as provided by the Mandrake Linux 9.2 installer. On my own machine I got crippling file system errors after just a few days with it. I quit using it at that point. Another friend was using ReiserFS and found it hung at boot time after a few days, needing a manual fsck. I've had less problems with 'ext3'.
Re: Need directory scheme to store 400,000 images
by RMGir (Prior) on Apr 12, 2004 at 17:22 UTC
    A similar approach worked fine for me back in the old dark ages of DOS and 8.3 file names. FAT-16 dealt pretty badly with huge (meaining over 2000 files, I didn't have to deal with 400k files) subdirectories.

    There probably IS a better way. But is it worth it?

    I'd suggest you set up a simple isolation layer, so you can do

    $fh=image_open("123456.jpg");
    Then run some tests with; it should be easy to create a sample 400k file subtree with something like the right distribution of file sizes, if you have the file space.

    If it's fast enough, then don't worry about it. If you have a performance issue, then worry about making it better.

    Offhand, I suspect if you have a performance issue, it will come from the horrible impact this kind of structure will have on your disk caching, since accessing any given file is going to involve reading 4 directories then the file, and your images may not be accessed in any kind of cache-friendly order.

    If it turns out the performance bites, you can change your open routine to use some other underlying structure without affecting the rest of the code...


    Mike
Re: Need directory scheme to store 400,000 images
by MidLifeXis (Monsignor) on Apr 12, 2004 at 17:30 UTC

    Hi Mark.

    Check out Re: Efficient processing of large directory, a previous node I wrote dealing with a very similar question. The solution is very implementation specific, and is based on what your filesystem's characteristics are.

    --MidLifeXis

    P.S. Is this an upgrade to your C::A image storage app?

      Yes, it's likely this code will wind up in a version of CPAN:CGI::Uploader. The most current version has a function called 'build_loc' to handle the path name generation. So it may be possible to support more than one storage scheme easily, or at least sub-class the module and provide your own if you don't like mine.

Re: Need directory scheme to store 400,000 images
by fizbin (Chaplain) on Apr 12, 2004 at 17:56 UTC
    I'll note that what I've seen before is more along these lines:
    1/2/3/123.jpg 8/6/7/8675309.jpg
    That is, not repeating the whole prefix up that point. You can then easily turn a simple filename into a file-and-directory thing with this:
    my $fullpath = $file; # Not just the number, but e.g. "7893.jpg" $fullpath =~ s|(\w{1,3}).*|join('/',split(//,$1),$&)|e or die "Bad filename $file";
    (adjust the "3" to a deeper directory hierarchy as desired). This will put a short filename - like 12.jpg - into a higher level directory than the other files (12.jpg becomes 1/2/12.jpg); if that's not desired, consider using a filename like 00000012.jpg instead of 12.jpg.

    On a small scale (1 initial letter), this is the way the terminfo database files are stored, and is also pretty much how Debian organizes their packages on their ftp sites.

    But I'd personally go with the ReiserFS suggestion myself, assuming you have full control of the target box.

Re: Need directory scheme to store 400,000 images
by pizza_milkshake (Monk) on Apr 13, 2004 at 06:08 UTC
    i've stored 1M+ images locally (pr0n spider, thank you). i split them up into directories of 10,000 images each based on a primary key calculated by a simple int(id/1e4). doing things like df -h took FOREVER, but i didn't have any major problems. filesystem was ext3 on a standard 7200rpm ide drive.

    perl -e'$_="nwdd\x7F^n\x7Flm{{llql0}qs\x14";s/./chr(ord$&^30)/ge;print'

Re: Need directory scheme to store 400,000 images
by Paulster2 (Priest) on Apr 13, 2004 at 14:07 UTC

    While we don't store as many images as you are talking about (usually around the 20K+ mark), we have 12TB of info that we deal with on a regular basis. Our scheme for keeping track of these images is splitting accross the different archives (RAID partitions) which are used as seperate directories, all linked to one central directory. From there we give each separate image a unique 10 character image code and use that as the name of the sub-directory. The image is then stored within that sub-directory. All of this is kept track of in an Oracle database.

    In the *nix world it would look something like this:

    /base_dir/images_archive1/<10_digit_unique_id>/image/<image_name>

    Now this is a very simplified version of how our archive is set up, but it does work fairly well. I think that the major player here is the database keeping everything straight. The previous solutions using the 10K per archive sounds like a good cutoff point to me, also. I know using to many image sub-directories will probably cause you problems.

    I know that this is not a Perl solution, but maybe it helps in some little way.

    Paulster2

Re: Need directory scheme to store 400,000 images
by bageler (Hermit) on Apr 13, 2004 at 19:59 UTC
    I've used both a hierarchial directory scheme based on a numerical ID as well as relational blobs in postgres to store large amounts of pr0n data for my last employer. With the high performance storage hardware and frontend (web farm) caching framework I put together, the performance was excellent.
Re: Need directory scheme to store 400,000 images
by rvosa (Curate) on Apr 13, 2004 at 22:53 UTC
    Is storing them as BLOBs in a database an option?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://344409]
Approved by kvale
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2023-09-27 19:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?