in reply to Re^2: Help with Space Efficency algorithim
in thread Help with Space Efficency algorithim

I would agree with your assessment. If the data is known ahead of time, then you can use the standard bin packing solutions.

If you are getting the new data over time, perhaps adding the new data to the directory with the smallest amount of space free larger than the size of the new data may be appropriate. I think that this would give the best packing over time.

@spaces = map {{dir => $_, size => dir_size($_)}} foreach (@dirs); $small_space = (sort by_dir_size grep larger_than_needed @spaces)[0]; store_in_space($newdata, $small_space);

code is pseudo code, but almost perl

--MidLifeXis

Replies are listed 'Best First'.
Re^4: Help with Space Efficency algorithim
by dragonchild (Archbishop) on Dec 27, 2007 at 21:06 UTC
    In a way, your way works nicely. It's really easy to comprehend, which makes it appealing from a maintenance perspective. However, your solution is guaranteed to create relatively significant fragmentation if your average filesize > (binsize / 3). (Not proven, but seems about right). If you have a larger average filesize, your best bet to avoid fragmentation is to pick the largest free space that will fit. That will reduce the fragmentation you will encounter.

    Another optimization is to use something like DBM::Deep to store the bins and how much space is left in them. That way, you don't have to recalculate it every time. Furthermore, dbm-deep is transactional, meaning that multiple processes don't clobber each other.


    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

      I need to think through the fragmentation aspect of this. In the extreme, if you choose the largest bucket that your data will fit into, you will always place your data in a new bucket. However, if you place the restriction that you must use an existing bucket if possible, I am not certain (yet) which will work better (select smallest bucket or largest bucket). It has been a while (a few years) since I have looked at this problem.

      I always welcome thought excercises :)

      --MidLifeXis

        The point I'm making is that there isn't a correct solution for all input sets. If you can know something about your input set, you can tune your algorithm. If you cannot, then a simple brute force algorithm may be the most maintainable solution. It's a rather generic point. :-)

        My criteria for good software:
        1. Does it work?
        2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?