perrin has asked for the wisdom of the Perl Monks concerning the following question:

Here's an interesting little puzzle. I have a set of files named after book ISBNs. These can be from five to nine characters long. There are thousands of these, and I don't want to put them all in one directory because that will hurt file system performance. I was planning to split them all up by zero-padding them all to nine characters and then slicing their numbers into directory names. For example, 123456789.gif would be stored at /12/34/56/123456789.gif.

Now, if I make too many directories I'll be hurting file system performance rather than helping. So, the challenge is to come up with a short algorithm for determining the optimal way to break up these files so that we have no more than n (probably 100) files or subdirectories in any one directory, and the minimum number of directories in total. This would be simple if the data was a uniform distribution, but in this case it isn't. It's possible for 99% of the data to have the same first five digits.

I was attacking this through brute force by simulating the directories with hashes and counting the number of files that fell into each one, but I feel like there must be some simple mathematical solution for this that I'm not seeing. Also, my approach makes it a pain to experiment with alternatives like /12345/67/12345689.gif, which might be the best solution with a given set of data.

I know this is kind of dumb to obsess over, and I could just stop fooling around and split them up on every two digits, but it turned into kind of an interesting brainteaser. Anyone?

  • Comment on brainteaser: splitting up a namespace evenly

Replies are listed 'Best First'.
Re (tilly) 1: brainteaser: splitting up a namespace evenly
by tilly (Archbishop) on Oct 24, 2001 at 06:16 UTC
    The perfect solution is not going to be easy to produce, and as you add to your data set, it is going to change.

    Instead just find a simple, "good enough" solution. Splitting every 2 digits sounds like a good one to me. If you want to complicate your code, you could make the decision about whether to split at the next 2 digits to be one that depends on how full your directory is. For instance your rule could be, "When the directory goes beyond 100 files, find the starting 2 digit code with the most entries, create that directory, and move those files into it."

    An alternate solution is (if you are using Linux) to switch to using a filesystem (eg ReiserFS) which is designed to efficiently handle directories with lots of small files. That is, instead of coding up a workaround for filesystem issues, use a filesystem without the problem in the first place.

Re: brainteaser: splitting up a namespace evenly
by blakem (Monsignor) on Oct 24, 2001 at 03:48 UTC
    If you really need them spread out evenly, you could take an MD5 hash of the filename and use it as a base for you directory structure... i.e.:
    md/5_hash_value_of_123/123.gif an/other_md5_have_value/456.gif ye/t_another_hash_value/789.gif
    Or, You might want to see if the last two digits are "spread out enough" and build a directory structure based on that.... i.e.:
    98/123456789.gif 09/123456790.gif 19/123456791.gif

    -Blake

      Sorry, I should have said this in my question:

      I'd prefer to have directory paths that a human can generate easily by looking at the ISBN. That means MD5 is out.

      UPDATE: The idea about whether or not the last two digits are "spread out enough" is the gist of what I'm looking for, only I don't want to have to manually try the last two digits, and then the next to last, etc. until I find the solution that spreads them out enough while having the fewest number of directories.

        OK, you'll have to give us a bit more information about the set of numbers you're trying to hash...... Tough to come up with a clever way to spread them out when we don't know much about the set. ;-)

        -Blake

Re: brainteaser: splitting up a namespace evenly
by clintp (Curate) on Oct 24, 2001 at 06:34 UTC
    You can either do this through sampling the population randomly or using the entire set as a base. Either way a straightforward way to do it would be to (let's assume the whole set):
    • Determine the optimum number of files in a directory (make up a number) or directories in a directory. Let's say...50. (for an example)
    • Load up an array with all of your data. Let's say there's 60000 ISBN's.
    • Determine the integer root of 60000 that closely yeilds 50. square root is 244, cube is roughly 38. Make your direcory depth 3 (cube root).
    • Sort your list (or your samples)
    • Split your list into 38 sub-list ranges. The first element in each of these 38 sections represents the uppermost bound for this section, the last the lowermost. This is your first level directory.
    • Split each of those into 38 sub-list ranges again. The first and last of each of these sublists represent the range of acceptable files for the second level directory. (These last two steps are nicely recursive...)
    • If you used a sample, you're gonna need a big enough sample to come close to 38^2.
    • Distribute the stuff in between accordingly into the sub-sub directories. You should now have 38 directories, with 38 subdirectories each with about 38 files.
Re: brainteaser: splitting up a namespace evenly
by dws (Chancellor) on Oct 24, 2001 at 04:28 UTC
    If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.

    If humans are going to be looking at the data, I recommend using the full ISBN for the filename. It'll save you mess later if/when you need to reorganize.

      If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.

      Well, the point is that I don't really know how they're split up except that I don't think they're very evenly distributed through the whole range. I did try it a couple of ways, but trying every possible combination manually isn't an efficient way to find the sweet spot.

      If humans are going to be looking at the data, I recommend using the full ISBN for the filename.

      Like this? /12/34/56/78/9/123456789.gif

        A quick way to determine the distribution of the last two digits is to create a hash table where the hash keys are each two digit combination and the each hash value is the total count of that combination. IE: The first time the script comes across '45' it uses exists() to check for the key, if the key is there then it ++ the value. If the key is not there then it creates the key and starts the value at 1.

        The end result might look something like:
        $hash = [
                '01' => 302,
                '02' => 404,
                '23' => 1002
        
                ... and so on 
        

        This would at least allow you to grasp the distribution of the data. I would definately use this sort of approach (ordering by the last n digits) because it allows for new data to be added without resizing any of the other directories.
Re: brainteaser: splitting up a namespace evenly
by joealba (Hermit) on Oct 24, 2001 at 06:59 UTC
    What kind of data is stored in the files? If it is structured data, it sounds like this information would be better stored in a database, rather than in all these files. That's what databases were created for -- store hundreds and thousands of records in an organized, optimized way for easy retrieval later.

    Otherwise, you can easily have 1000 files in a directory without hurting performance. Grab the first 4 or 5 digits for your directory naming convention. That will make it easy to find the files when you need them. Don't make yourself traverse more than 1 directory if it's really not needed.
      They're image files, and my experiences with stuffing images into databases have all been negative. Thanks for the suggestion though.

      It may be that more than 1000 files would not hurt performance much (this is Solaris), but it's just much easier to work with directories that have a reasonable number of files in them, i.e. don't break tools like ls.

Re: brainteaser: splitting up a namespace evenly
by Erik Hensema (Sexton) on Oct 24, 2001 at 13:25 UTC

    Warning: non-perl solution ahead! ;-)

    If you're on Linux you may want to go for a ReiserFS partition. ReiserFS uses binary trees to index directories and is still VERY fast on directories holding hundreds of thousands of files.

    It may take a while for a simple 'ls' to complete though, because ls sorts alphabetically.

Re: brainteaser: splitting up a namespace evenly
by pike (Monk) on Oct 24, 2001 at 14:45 UTC
    OK, here is my proposal: my script makes a hierarchy of directories, the depth depending on how many files fall in the given namespace. It ensures that there are not more than a given number of files in any dir. It doesn't guarantee, however, that some of the dirs contain fewer files. But it's fairly simple and I suggest you give it a go and let me know how it worked for you...

    #!/usr/bin/perl -w use strict; my $maxFiles = 100; #max no of files per directory my $dir = shift; chdir $dir or die "Cant change to directory $dir: $!\n"; foreach my $i (0..9) { partitionRecursive ($i, '.'); } sub partitionRecursive { my ($prefix, $path) = @_; my @files = glob ("$prefix*"); return unless @files; mkdir "$path/_$prefix" or die "Cant make dir $path/_$prefix: $!\n" +; if (@files < $maxFiles) { #if there are less than maxFiles, move them all to the new dir system ("mv $prefix* $path/_$prefix") == 0 or die "Cant move f +iles to $prefix: $!\n"; } else { #otherwise, move those with one more character than the prefix + to the new dir #can be omitted if all filenames are equal in length system ("mv $prefix? $path/_$prefix") == 0 or die "Cant move f +iles to $prefix: $!\n"; } foreach my $nextDigit (0..9) { partitionRecursive ("$prefix$nextDigit", "$path/_$prefix"); } }

    pike