brainteaser: splitting up a namespace evenly

perrin has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re (tilly) 1: brainteaser: splitting up a namespace evenly by tilly (Archbishop) on Oct 24, 2001 at 06:16 UTC
The perfect solution is not going to be easy to produce, and as you add to your data set, it is going to change. Instead just find a simple, "good enough" solution. Splitting every 2 digits sounds like a good one to me. If you want to complicate your code, you could make the decision about whether to split at the next 2 digits to be one that depends on how full your directory is. For instance your rule could be, "When the directory goes beyond 100 files, find the starting 2 digit code with the most entries, create that directory, and move those files into it." An alternate solution is (if you are using Linux) to switch to using a filesystem (eg ReiserFS) which is designed to efficiently handle directories with lots of small files. That is, instead of coding up a workaround for filesystem issues, use a filesystem without the problem in the first place.	[reply]
Re: brainteaser: splitting up a namespace evenly by blakem (Monsignor) on Oct 24, 2001 at 03:48 UTC
If you really need them spread out evenly, you could take an MD5 hash of the filename and use it as a base for you directory structure... i.e.: `md/5_hash_value_of_123/123.gif an/other_md5_have_value/456.gif ye/t_another_hash_value/789.gif` [download] Or, You might want to see if the last two digits are "spread out enough" and build a directory structure based on that.... i.e.: `98/123456789.gif 09/123456790.gif 19/123456791.gif` [download] -Blake	[reply] [d/l] [select]
Re: Re: brainteaser: splitting up a namespace evenly by perrin (Chancellor) on Oct 24, 2001 at 03:54 UTC
Sorry, I should have said this in my question: I'd prefer to have directory paths that a human can generate easily by looking at the ISBN. That means MD5 is out. UPDATE: The idea about whether or not the last two digits are "spread out enough" is the gist of what I'm looking for, only I don't want to have to manually try the last two digits, and then the next to last, etc. until I find the solution that spreads them out enough while having the fewest number of directories.	[reply]
Re: Re: Re: brainteaser: splitting up a namespace evenly by blakem (Monsignor) on Oct 24, 2001 at 03:57 UTC
OK, you'll have to give us a bit more information about the set of numbers you're trying to hash...... Tough to come up with a clever way to spread them out when we don't know much about the set. ;-) -Blake	[reply]
Re: Re: Re: Re: brainteaser: splitting up a namespace evenly by perrin (Chancellor) on Oct 24, 2001 at 04:00 UTC
Re5: brainteaser: splitting up a namespace evenly by blakem (Monsignor) on Oct 24, 2001 at 04:31 UTC
Re: brainteaser: splitting up a namespace evenly by clintp (Curate) on Oct 24, 2001 at 06:34 UTC
You can either do this through sampling the population randomly or using the entire set as a base. Either way a straightforward way to do it would be to (let's assume the whole set): Determine the optimum number of files in a directory (make up a number) or directories in a directory. Let's say...50. (for an example) Load up an array with all of your data. Let's say there's 60000 ISBN's. Determine the integer root of 60000 that closely yeilds 50. square root is 244, cube is roughly 38. Make your direcory depth 3 (cube root). Sort your list (or your samples) Split your list into 38 sub-list ranges. The first element in each of these 38 sections represents the uppermost bound for this section, the last the lowermost. This is your first level directory. Split each of those into 38 sub-list ranges again. The first and last of each of these sublists represent the range of acceptable files for the second level directory. (These last two steps are nicely recursive...) If you used a sample, you're gonna need a big enough sample to come close to 38^2. Distribute the stuff in between accordingly into the sub-sub directories. You should now have 38 directories, with 38 subdirectories each with about 38 files.	[reply]
Re: brainteaser: splitting up a namespace evenly by dws (Chancellor) on Oct 24, 2001 at 04:28 UTC
If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories. If humans are going to be looking at the data, I recommend using the full ISBN for the filename. It'll save you mess later if/when you need to reorganize.	[reply]
Re: Re: brainteaser: splitting up a namespace evenly by perrin (Chancellor) on Oct 24, 2001 at 05:50 UTC
If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories. Well, the point is that I don't really know how they're split up except that I don't think they're very evenly distributed through the whole range. I did try it a couple of ways, but trying every possible combination manually isn't an efficient way to find the sweet spot. If humans are going to be looking at the data, I recommend using the full ISBN for the filename. Like this? /12/34/56/78/9/123456789.gif	[reply]
Re: Re: Re: brainteaser: splitting up a namespace evenly by driffero (Initiate) on Oct 25, 2001 at 03:42 UTC
A quick way to determine the distribution of the last two digits is to create a hash table where the hash keys are each two digit combination and the each hash value is the total count of that combination. IE: The first time the script comes across '45' it uses exists() to check for the key, if the key is there then it ++ the value. If the key is not there then it creates the key and starts the value at 1. The end result might look something like: $hash = [ '01' => 302, '02' => 404, '23' => 1002 ... and so on This would at least allow you to grasp the distribution of the data. I would definately use this sort of approach (ordering by the last n digits) because it allows for new data to be added without resizing any of the other directories.	[reply]
Re: brainteaser: splitting up a namespace evenly by joealba (Hermit) on Oct 24, 2001 at 06:59 UTC
What kind of data is stored in the files? If it is structured data, it sounds like this information would be better stored in a database, rather than in all these files. That's what databases were created for -- store hundreds and thousands of records in an organized, optimized way for easy retrieval later. Otherwise, you can easily have 1000 files in a directory without hurting performance. Grab the first 4 or 5 digits for your directory naming convention. That will make it easy to find the files when you need them. Don't make yourself traverse more than 1 directory if it's really not needed.	[reply]
Re: Re: brainteaser: splitting up a namespace evenly by perrin (Chancellor) on Oct 24, 2001 at 17:37 UTC
They're image files, and my experiences with stuffing images into databases have all been negative. Thanks for the suggestion though. It may be that more than 1000 files would not hurt performance much (this is Solaris), but it's just much easier to work with directories that have a reasonable number of files in them, i.e. don't break tools like ls.	[reply]
Re: brainteaser: splitting up a namespace evenly by Erik Hensema (Sexton) on Oct 24, 2001 at 13:25 UTC
Warning: non-perl solution ahead! ;-) If you're on Linux you may want to go for a ReiserFS partition. ReiserFS uses binary trees to index directories and is still VERY fast on directories holding hundreds of thousands of files. It may take a while for a simple 'ls' to complete though, because ls sorts alphabetically.	[reply]
Re: brainteaser: splitting up a namespace evenly by pike (Monk) on Oct 24, 2001 at 14:45 UTC
OK, here is my proposal: my script makes a hierarchy of directories, the depth depending on how many files fall in the given namespace. It ensures that there are not more than a given number of files in any dir. It doesn't guarantee, however, that some of the dirs contain fewer files. But it's fairly simple and I suggest you give it a go and let me know how it worked for you... #!/usr/bin/perl -w use strict; my $maxFiles = 100; #max no of files per directory my $dir = shift; chdir $dir or die "Cant change to directory $dir: $!\n"; foreach my $i (0..9) { partitionRecursive ($i, '.'); } sub partitionRecursive { my ($prefix, $path) = @_; my @files = glob ("$prefix"); return unless @files; mkdir "$path/_$prefix" or die "Cant make dir $path/_$prefix: $!\n" +; if (@files < $maxFiles) { #if there are less than maxFiles, move them all to the new dir system ("mv $prefix $path/_$prefix") == 0 or die "Cant move f +iles to $prefix: $!\n"; } else { #otherwise, move those with one more character than the prefix + to the new dir #can be omitted if all filenames are equal in length system ("mv $prefix? $path/_$prefix") == 0 or die "Cant move f +iles to $prefix: $!\n"; } foreach my $nextDigit (0..9) { partitionRecursive ("$prefix$nextDigit", "$path/_$prefix"); } } [download] pike	[reply] [d/l]