in reply to Re: brainteaser: splitting up a namespace evenly
in thread brainteaser: splitting up a namespace evenly

If you want to limit yourself to 0 < n < 100 file per subdirectory, and the leading 5 digits are same 99% of the time, then using the last 2 digits to form the bucket (subdirectory) may be the best way to go. Try it on your sample data to get a feel for wether it would create too many top-level directories.

Well, the point is that I don't really know how they're split up except that I don't think they're very evenly distributed through the whole range. I did try it a couple of ways, but trying every possible combination manually isn't an efficient way to find the sweet spot.

If humans are going to be looking at the data, I recommend using the full ISBN for the filename.

Like this? /12/34/56/78/9/123456789.gif

  • Comment on Re: Re: brainteaser: splitting up a namespace evenly

Replies are listed 'Best First'.
Re: Re: Re: brainteaser: splitting up a namespace evenly
by driffero (Initiate) on Oct 25, 2001 at 03:42 UTC
    A quick way to determine the distribution of the last two digits is to create a hash table where the hash keys are each two digit combination and the each hash value is the total count of that combination. IE: The first time the script comes across '45' it uses exists() to check for the key, if the key is there then it ++ the value. If the key is not there then it creates the key and starts the value at 1.

    The end result might look something like:
    $hash = [
            '01' => 302,
            '02' => 404,
            '23' => 1002
    
            ... and so on 
    

    This would at least allow you to grasp the distribution of the data. I would definately use this sort of approach (ordering by the last n digits) because it allows for new data to be added without resizing any of the other directories.