Storing uploaded files for speed and efficiency

legLess has asked for the wisdom of the Perl Monks concerning the following question:

Monks ~

I'm writing a script which accepts a JPEG upload from a user and, after appropriate checks, stores it for later display. So far so good, thanks in large part to the time I've spent here.

But a little knowledge is a dangerous thing, and I know just enough programming to realize I don't know how best to store or organize these files. Here's what I know (or think I know):

From what I know of file systems, it's not efficient to dump hundreds of files in the same directory because it dramatically slows search and retreival.
Thus, I need an algorithm to somehow distribute the uploaded files over a directory structure.
Um ... that's it.

Left to my own devices I'd make a hard limit on the number of files allowed in a directory and, when that limit's reached, move to a new directory and start to fill it. Lather, rinse, repeat. I'd probably use 3 characters (e.g. "aaa," "aab") for directory names. The server in question is running FreeBSD, and I'm currently trying to find a good number for "max_directory_entries."

From a programming class I took yon these many moons ago (not my degree, obviously), I remember using hash structures for storing and retrieving data. My sense is that this would be overkill for what I need. I'm never going to search these directories, since the database will always know exactly where every image is.

Does anyone have any insight or suggestions?

Thanks!
--
man with no legs, inc.

Comment on Storing uploaded files for speed and efficiency

Replies are listed 'Best First'.
Re: Storing uploaded files for speed and efficiency by suaveant (Parson) on Jul 31, 2001 at 19:30 UTC
if you want to use 3 characters, I would suggest doing directories like /a/aa/aac /g/ga/gab which gives you only 26 directories (more if you do numbers or capitalization) per level... much easier to look at as a human :) that is a pretty common practice, and it is pretty easy to do... it should also keep you pretty well separated, unless your files will have a tendancy toward the same name... then you might want some kind of hashing function, like length of filename plus first and last letter, or something else like that... for randomly distributed filenames, I'd say this /a/an/ant is fine... though I would only create the dirs as they were needed... that way you don't have to have extras like zxq sittig around eating up inodes... - Ant	[reply]
Re: Re: Storing uploaded files for speed and efficiency by legLess (Hermit) on Jul 31, 2001 at 19:43 UTC
Thanks ant, tree structure is a good suggestion. -- man with no legs, inc.	[reply]
Re: Storing uploaded files for speed and efficiency by tadman (Prior) on Jul 31, 2001 at 20:09 UTC
I put this in the code archives as Directory Hashing Algorithm. Reprinted here for convenience. The problem with directory "hashing" is you have to do it right or you're not solving any problems. Using substr() or XOR-math seems to create a pseudo-random distribution, but similar strings often end up getting filed together. That is, if someone uploads picture0001.jpg through picture9999.jpg, your "hash" might not distribute them very well, and performance will suffer. Since the RC5-hash is a fine piece of work, and I'm not in any position to achieve better, I used the "secret" RC5 mode of the crypt function. Despite the fact that I'm using the same "salt" every time, the output is quite random, and a change as small as one bit can create big waves, as any good hash should. QMail uses a similar technique to store e-mail messages, as you can imagine that there can be thousands of these on any given server, and that access time is paramount. #!/usr/bin/perl -w use strict; # hashed - Create a hashed directory path for a given filename # using an RC5-hash generated by the crypt() function. sub hashed { my ($name) = @_; # Send the input into the grinder until it comes out the # right size. It shrinks blocks of up to 12 characters # into only two with each pass. while (length($name) > 4) { my $crypt; foreach ($name =~ /.{1,12}/gs) { $crypt .= substr(crypt($_,'$1$ABCDEFGH'),12,2) +; } $name = $crypt; } # Fix unruly characters, as crypt will gleefully return # '/' in the hashed strings. These are converted to 'Z' $name =~ y!/!Z!; # Split the returned string into a full path, including # the specified filename. return join ('/', ($name =~ /../g), $_[0]); } print hashed ("thisimage.gif"),"\n"; print hashed ("thisimage.jpg"),"\n"; print hashed ("thisimage1.gif"),"\n"; print hashed ("thisimage2.gif"),"\n"; [download] The output, for the curious, is: `0u/uh/thisimage.gif VO/Bz/thisimage.jpg 68/g1/thisimage1.gif dT/g1/thisimage2.gif` [download] If you over-hash, you end up with far too many directories and not enough files. Feel free to tweak this to make it better fit your application.	[reply] [d/l] [select]
Re: Storing uploaded files for speed and efficiency by Agermain (Scribe) on Jul 31, 2001 at 19:33 UTC
Why don't you store things by date, then, if you don't need to be able to access these things later? If you need to have three characters, why not make them symbolic (A=January, B=February, C=March, etc;) So you could tack on the year, and you'd have a directory like /01G for July 2001. If you have too many files per month, then you may want to switch to a new subdirectory per week or per day, i.e. /2001/jul/31/___.jpg or /01G/31/___.jpg ; The main reason behind this is too many subdirectories in a directory is just as bad as too many files. The date-oriented approach is taken by a lot of websites - such as Webmonkey and Hotwired - even those that aren't database-oriented. agermain "In fact, we must do just the opposite and remain ever-vigilant, striving to ensure that we stop the Urkels of tomorrow before they gain power. Only by demanding full accountability can we reverse the shameful legacy of man's inanity to man."	[reply]
Re: Re: Storing uploaded files for speed and efficiency by legLess (Hermit) on Jul 31, 2001 at 19:47 UTC
Thanks for the reply, Agermain. I don't think this solution would be as clean as ant's "a/an/ant", though. I don't have any way of predicting how much traffic this is goign to get, so maybe one directory a month is plenty, maybe one a day is better. I don't want to have to adjust the algorithm after the fact depending on traffic. I think your idea works better than my original one, though, since that would end up with 26^3 subdirectories! (doh) -- man with no legs, inc.	[reply]