in reply to Re: Help with Space Efficency algorithim
in thread Help with Space Efficency algorithim

MidLifeXis,
Your advice may not help jkhunt. It sounds like it is possible that the packages are being placed over time and are not all available up front for packing. That makes the question:

How can I tell how much free space is used in a list of directories?

use Cwd; use constant LIMIT => 4 * 1024 * 1024 * 1000; my @dirs; # perhaps found with File::Find for my $dir (@dirs) { my $avail = LIMIT - dir_size($dir); print "'$dir' has $avail free\n"; } sub dir_size { my ($dir) = @_; my $orig_dir = getcwd; chdir '$dir' or die "Unable to chdir to '$dir': $!"; open(my $dh, '.') or die "Unable to open a dir handle for '$dir': +$!"; my $size; while (<$dh>) { next if ! -f $_; $size += -s $_; # could use _ shortcut } chdir $orig_dir or die "Unable to restore dir to '$orig_dir': $!"; return $size; }

If, on the other hand, the packages are all available up front - I too would recommend a "good enough" algorithm rather than finding the perfect solution.

Cheers - L~R

Replies are listed 'Best First'.
Re^3: Help with Space Efficency algorithim
by MidLifeXis (Monsignor) on Dec 27, 2007 at 20:26 UTC

    I would agree with your assessment. If the data is known ahead of time, then you can use the standard bin packing solutions.

    If you are getting the new data over time, perhaps adding the new data to the directory with the smallest amount of space free larger than the size of the new data may be appropriate. I think that this would give the best packing over time.

    @spaces = map {{dir => $_, size => dir_size($_)}} foreach (@dirs); $small_space = (sort by_dir_size grep larger_than_needed @spaces)[0]; store_in_space($newdata, $small_space);

    code is pseudo code, but almost perl

    --MidLifeXis

      In a way, your way works nicely. It's really easy to comprehend, which makes it appealing from a maintenance perspective. However, your solution is guaranteed to create relatively significant fragmentation if your average filesize > (binsize / 3). (Not proven, but seems about right). If you have a larger average filesize, your best bet to avoid fragmentation is to pick the largest free space that will fit. That will reduce the fragmentation you will encounter.

      Another optimization is to use something like DBM::Deep to store the bins and how much space is left in them. That way, you don't have to recalculate it every time. Furthermore, dbm-deep is transactional, meaning that multiple processes don't clobber each other.


      My criteria for good software:
      1. Does it work?
      2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?

        I need to think through the fragmentation aspect of this. In the extreme, if you choose the largest bucket that your data will fit into, you will always place your data in a new bucket. However, if you place the restriction that you must use an existing bucket if possible, I am not certain (yet) which will work better (select smallest bucket or largest bucket). It has been a while (a few years) since I have looked at this problem.

        I always welcome thought excercises :)

        --MidLifeXis

Re^3: Help with Space Efficency algorithim
by jkhunt (Initiate) on Dec 28, 2007 at 15:56 UTC
    Thank you MidLifeXis, but all my packages are available. I am moving them from one location to another for space efficiency. The bin packing solution seems like it is the best for my issue. Does anyone have a perl example of the bin packing solution? I have a hash with pkg A through E with the size. Can someone help?