in reply to brainteaser: splitting up a namespace evenly

If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.

If humans are going to be looking at the data, I recommend using the full ISBN for the filename. It'll save you mess later if/when you need to reorganize.

  • Comment on Re: brainteaser: splitting up a namespace evenly

Replies are listed 'Best First'.
Re: Re: brainteaser: splitting up a namespace evenly
by perrin (Chancellor) on Oct 24, 2001 at 05:50 UTC
    If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.

    Well, the point is that I don't really know how they're split up except that I don't think they're very evenly distributed through the whole range. I did try it a couple of ways, but trying every possible combination manually isn't an efficient way to find the sweet spot.

    If humans are going to be looking at the data, I recommend using the full ISBN for the filename.

    Like this? /12/34/56/78/9/123456789.gif

      A quick way to determine the distribution of the last two digits is to create a hash table where the hash keys are each two digit combination and the each hash value is the total count of that combination. IE: The first time the script comes across '45' it uses exists() to check for the key, if the key is there then it ++ the value. If the key is not there then it creates the key and starts the value at 1.

      The end result might look something like:
      $hash = [
              '01' => 302,
              '02' => 404,
              '23' => 1002
      
              ... and so on 
      

      This would at least allow you to grasp the distribution of the data. I would definately use this sort of approach (ordering by the last n digits) because it allows for new data to be added without resizing any of the other directories.