Re (tilly) 1: brainteaser: splitting up a namespace evenly
by tilly (Archbishop) on Oct 24, 2001 at 06:16 UTC
|
The perfect solution is not going to be easy to produce, and
as you add to your data set, it is going to change.
Instead just find a simple, "good enough" solution.
Splitting every 2 digits sounds like a good one to me. If
you want to complicate your code, you could make the
decision about whether to split at the next 2 digits to be
one that depends on how full your directory is. For
instance your rule could be, "When the directory goes beyond
100 files, find the starting 2 digit code with the most
entries, create that directory, and move those files into
it."
An alternate solution is (if you are using Linux) to switch
to using a filesystem (eg ReiserFS) which is designed to
efficiently handle directories with lots of small files.
That is, instead of coding up a workaround for filesystem
issues, use a filesystem without the problem in the first
place. | [reply] |
Re: brainteaser: splitting up a namespace evenly
by blakem (Monsignor) on Oct 24, 2001 at 03:48 UTC
|
If you really need them spread out evenly, you could take an MD5 hash of the filename and use it as a base for you directory structure... i.e.:
md/5_hash_value_of_123/123.gif
an/other_md5_have_value/456.gif
ye/t_another_hash_value/789.gif
Or, You might want to see if the last two digits are "spread out enough" and build a directory structure based on that.... i.e.:
98/123456789.gif
09/123456790.gif
19/123456791.gif
-Blake
| [reply] [d/l] [select] |
|
|
Sorry, I should have said this in my question:
I'd prefer to have directory paths that a human can generate easily by looking at the ISBN. That means MD5 is out.
UPDATE: The idea about whether or not the last two digits are "spread out enough" is the gist of what I'm looking for, only I don't want to have to manually try the last two digits, and then the next to last, etc. until I find the solution that spreads them out enough while having the fewest number of directories.
| [reply] |
|
|
OK, you'll have to give us a bit more information about the set of numbers you're trying to hash...... Tough to come up with a clever way to spread them out when we don't know much about the set. ;-)
-Blake
| [reply] |
|
|
|
|
Re: brainteaser: splitting up a namespace evenly
by clintp (Curate) on Oct 24, 2001 at 06:34 UTC
|
You can either do this through sampling the population randomly or using the entire set as a base. Either way a straightforward way to do it would be to (let's assume the whole set):
- Determine the optimum number of files in a directory (make up a number) or directories in a directory. Let's say...50. (for an example)
- Load up an array with all of your data. Let's say there's 60000 ISBN's.
- Determine the integer root of 60000 that closely yeilds 50. square root is 244, cube is roughly 38. Make your direcory depth 3 (cube root).
- Sort your list (or your samples)
- Split your list into 38 sub-list ranges. The first element in each of these 38 sections represents the uppermost bound for this section, the last the lowermost. This is your first level directory.
- Split each of those into 38 sub-list ranges again. The first and last of each of these sublists represent the range of acceptable files for the second level directory. (These last two steps are nicely recursive...)
- If you used a sample, you're gonna need a big enough sample to come close to 38^2.
- Distribute the stuff in between accordingly into the sub-sub directories. You should now have 38 directories, with 38 subdirectories each with about 38 files.
| [reply] |
Re: brainteaser: splitting up a namespace evenly
by dws (Chancellor) on Oct 24, 2001 at 04:28 UTC
|
If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.
If humans are going to be looking at the data, I recommend using the full ISBN for the filename. It'll save you mess later if/when you need to reorganize.
| [reply] |
|
|
If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.
Well, the point is that I don't really know how they're split up except that I don't think they're very evenly distributed through the whole range. I did try it a couple of ways, but trying every possible combination manually isn't an efficient way to find the sweet spot.
If humans are going to be looking at the data, I recommend using the full ISBN for the filename.
Like this? /12/34/56/78/9/123456789.gif
| [reply] |
|
|
A quick way to determine the distribution of the last two digits is to create a hash table where the hash keys are each two digit combination and the each hash value is the total count of that combination. IE: The first time the script comes across '45' it uses exists() to check for the key, if the key is there then it ++ the value. If the key is not there then it creates the key and starts the value at 1.
The end result might look something like:
$hash = [
'01' => 302,
'02' => 404,
'23' => 1002
... and so on
This would at least allow you to grasp the distribution of the data. I would definately use this sort of approach (ordering by the last n digits) because it allows for new data to be added without resizing any of the other directories.
| [reply] |
Re: brainteaser: splitting up a namespace evenly
by joealba (Hermit) on Oct 24, 2001 at 06:59 UTC
|
What kind of data is stored in the files? If it is structured data, it sounds like this information would be better stored in a database, rather than in all these files. That's what databases were created for -- store hundreds and thousands of records in an organized, optimized way for easy retrieval later.
Otherwise, you can easily have 1000 files in a directory without hurting performance. Grab the first 4 or 5 digits for your directory naming convention. That will make it easy to find the files when you need them. Don't make yourself traverse more than 1 directory if it's really not needed. | [reply] |
|
|
They're image files, and my experiences with stuffing images into databases have all been negative. Thanks for the suggestion though.
It may be that more than 1000 files would not hurt performance much (this is Solaris), but it's just much easier to work with directories that have a reasonable number of files in them, i.e. don't break tools like ls.
| [reply] |
Re: brainteaser: splitting up a namespace evenly
by Erik Hensema (Sexton) on Oct 24, 2001 at 13:25 UTC
|
Warning: non-perl solution ahead! ;-)
If you're on Linux you may want to go for a ReiserFS partition. ReiserFS uses binary trees to index directories and is still VERY fast on directories holding hundreds of thousands of files.
It may take a while for a simple 'ls' to complete though, because ls sorts alphabetically.
| [reply] |
Re: brainteaser: splitting up a namespace evenly
by pike (Monk) on Oct 24, 2001 at 14:45 UTC
|
OK, here is my proposal: my script makes a hierarchy of directories, the depth depending on how many files fall in the given namespace. It ensures that there are not more than a given number of files in any dir. It doesn't guarantee, however, that some of the dirs contain fewer files. But it's fairly simple and I suggest you give it a go and let me know how it worked for you...
#!/usr/bin/perl -w
use strict;
my $maxFiles = 100; #max no of files per directory
my $dir = shift;
chdir $dir or die "Cant change to directory $dir: $!\n";
foreach my $i (0..9) {
partitionRecursive ($i, '.');
}
sub partitionRecursive {
my ($prefix, $path) = @_;
my @files = glob ("$prefix*");
return unless @files;
mkdir "$path/_$prefix" or die "Cant make dir $path/_$prefix: $!\n"
+;
if (@files < $maxFiles) {
#if there are less than maxFiles, move them all to the new dir
system ("mv $prefix* $path/_$prefix") == 0 or die "Cant move f
+iles to $prefix: $!\n";
}
else {
#otherwise, move those with one more character than the prefix
+ to the new dir
#can be omitted if all filenames are equal in length
system ("mv $prefix? $path/_$prefix") == 0 or die "Cant move f
+iles to $prefix: $!\n";
}
foreach my $nextDigit (0..9) {
partitionRecursive ("$prefix$nextDigit", "$path/_$prefix");
}
}
pike | [reply] [d/l] |