You can either do this through sampling the population randomly or using the entire set as a base. Either way a straightforward way to do it would be to (let's assume the whole set):
- Determine the optimum number of files in a directory (make up a number) or directories in a directory. Let's say...50. (for an example)
- Load up an array with all of your data. Let's say there's 60000 ISBN's.
- Determine the integer root of 60000 that closely yeilds 50. square root is 244, cube is roughly 38. Make your direcory depth 3 (cube root).
- Sort your list (or your samples)
- Split your list into 38 sub-list ranges. The first element in each of these 38 sections represents the uppermost bound for this section, the last the lowermost. This is your first level directory.
- Split each of those into 38 sub-list ranges again. The first and last of each of these sublists represent the range of acceptable files for the second level directory. (These last two steps are nicely recursive...)
- If you used a sample, you're gonna need a big enough sample to come close to 38^2.
- Distribute the stuff in between accordingly into the sub-sub directories. You should now have 38 directories, with 38 subdirectories each with about 38 files.