Re: brainteaser: splitting up a namespace evenly

If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.

If humans are going to be looking at the data, I recommend using the full ISBN for the filename. It'll save you mess later if/when you need to reorganize.

Comment on Re: brainteaser: splitting up a namespace evenly

Replies are listed 'Best First'.
Re: Re: brainteaser: splitting up a namespace evenly by perrin (Chancellor) on Oct 24, 2001 at 05:50 UTC
If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories. Well, the point is that I don't really know how they're split up except that I don't think they're very evenly distributed through the whole range. I did try it a couple of ways, but trying every possible combination manually isn't an efficient way to find the sweet spot. If humans are going to be looking at the data, I recommend using the full ISBN for the filename. Like this? /12/34/56/78/9/123456789.gif	[reply]
Re: Re: Re: brainteaser: splitting up a namespace evenly by driffero (Initiate) on Oct 25, 2001 at 03:42 UTC
A quick way to determine the distribution of the last two digits is to create a hash table where the hash keys are each two digit combination and the each hash value is the total count of that combination. IE: The first time the script comes across '45' it uses exists() to check for the key, if the key is there then it ++ the value. If the key is not there then it creates the key and starts the value at 1. The end result might look something like: $hash = [ '01' => 302, '02' => 404, '23' => 1002 ... and so on This would at least allow you to grasp the distribution of the data. I would definately use this sort of approach (ordering by the last n digits) because it allows for new data to be added without resizing any of the other directories.	[reply]