in reply to file chunk algorithm

If you manage to come up with an optimal solution, let me know :-)

What you are doing is equivalent to bin-packing, which is an NP-complete problem. Your codes employs a simple "next-fit" algorithm, which is somewhat inefficient, but easy to code. Better heuristics exist, such as first fit, best fit, and first-fit/best-fit decreasing.

One good way to do this (first fit decreasing) is to construct a hash of your files and their associated sizes, sort it in decreasing order of size, and pack the biggest items first. Keep track of each bin's free space, and continue adding items to the first bin in which it can fit. Add additional bins only when you can't find a fit.

Update: Couldn't resist the lure of an algorithm implementation...

use Data::Dumper; use File::Find; my %filesize; my %bins; my $num_bins = 0; my $max_bin_size = 10*1000*1000; my $top_dir = '.'; find(sub { $filesize{$File::Find::name} = -s }, $top_dir); for my $file (sort { $filesize{$b} <=> $filesize{$a} } keys %filesize) + { my $fsize = $filesize{$file}; my $bin; for (keys %bins) { if ($bins{$_}{size} + $fsize < $max_bin_size) { $bin = $_; last; } } $bin = $num_bins++ if not defined $bin; $bins{$bin}{size} += $fsize; push @{$bins{$bin}{files}}, $file; } print Dumper(\%bins);
   MeowChow                                   
               s aamecha.s a..a\u$&owag.print

Replies are listed 'Best First'.
Re: Re: file chunk algorithm
by thealienz1 (Pilgrim) on Feb 26, 2001 at 20:32 UTC

    Great solution, but the only problem is that eats away at the memory... i keep getting "out of memory error", and that is just creating the filename and size hash... maybe there is still a better way.

    I am the first overweight, very tall munchkin... be very amazed.
      Here is an OO solution that has fixed memory limits. The API is a little strange, the bucket is destroyed when it becomes full, and upon destruction you should ensure that it gets dumped somewhere...
      package Bin; use Carp; use strict; # Add an element to a bucket, returns the head of the list. sub add { my $self = shift; my $thing = shift; my $size = shift; if ($size <= ($self->{max_size} - $self->{size})) { $self->{size} += $size; push @{$self->{things}}, $thing; $self->{skipped} = 0; } elsif ($size > $self->{max_size}) { confess("'$thing' larger than a bucket ($size vs $self->{max_size} +)"); } else { $self->{next} ||= new Bin(@$self{'max_size', 'max_skips'}); if ($self->{max_skips} < ++$self->{skipped}) { return $self->{next}->add($thing, $size); } else { $self->{next} = $self->{next}->add($thing, $size); } } return $self; } sub DESTROY { my $self = shift; print "Bucket size $self->{size}: @{$self->{things}}\n"; <STDIN>; } sub new { my $self = bless {}, shift; $self->{max_size} = shift || 10_000_000; $self->{max_skips} = shift || 50; $self->{skipped} = 0; $self->{size} = 0; if (@_) { $self->add(@_); } $self; } package main; my $bin = new Bin; foreach my $cnt (9_800..10_001) { foreach (1, 10, 100) { $bin = $bin->add("$cnt\_$_", $cnt * $_); } } undef($bin);
      Yes, it does eat up memory when run on a large number of files. You may also be hitting on an issue with Data::Dumper, which snarfs memory like it's going out of style on huge data-structures.

      Try switching over to an "on-line" bin-packing algorithm as shown below (this one is a basic first-fit), but first try replacing just the print Dumper ... with the mydump routine below.

      update: I've confirmed that this was a Data::Dumper issue, at least on my system. When I switched my first response's code to use the mydump routine on my 9 gig drive, with 160,000 files and a $max_bin_size of 1 MB (~10,000 total bins), memory use did not exceed 50 MB. With Data::Dumper doing the dump, well, even 700MB of RAM couldn't stop the inevitable.

      use File::Find; my %filesize; my %bins; my $num_bins = 0; my $max_bin_size = 10*1000*1000; my $top_dir = '.'; $| = 1; find(\&packer, $top_dir); sub packer { print "finding: $_\n"; my $file = $File::Find::name; my $fsize = -s; my $bin; for (keys %bins) { if ($bins{$_}{size} + $fsize < $max_bin_size) { $bin = $_; last; } } $bin = $num_bins++ if not defined $bin; $bins{$bin}{size} += $fsize; push @{$bins{$bin}{files}}, $file; } sub mydump { my $bins = shift; while (my ($bnum, $bstruct) = each %$bins) { print "Bin Number: $bnum ($bstruct->{size} bytes)\n"; print " $_\n" for @{$bstruct->{files}}; } } print mydump(\%bins);
         MeowChow                                   
                     s aamecha.s a..a\u$&owag.print