mgrangeiro has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone! I was assigned a project at work to write or find an algorithm that will help us backup our data into DVDs more effectively, by maximizing storage, and if possible more efficiently. Anybody has the heart and soul of an angel to get me started on resources and perhaps a sample similar script? Not looking for someone to write it for me, but a head start will be much appreciated. Peace, Marco
  • Comment on I want to maximize data storage in a DVD

Replies are listed 'Best First'.
Re: I want to maximize data storage in a DVD
by almut (Canon) on Mar 30, 2010 at 18:52 UTC
      Yes almut, I already have looked at them and they are helpful. Know of any sites that might have some samples of codes by any chance? Thank you!

        There's an example at the bottom of Algorithm::Knapsack. For weights, you'd use the size of the files and for the capacity, you'd use the capacity of a DVD.

        Solution:

        [ almust says this doesn't work. See Re^3: I want to maximize data storage in a DVD ]

        use strict; use warnings; use Algorithm::Knapsack qw( ); use constant DVD_CAPACITY => ...; # Replace this function sub process_dvd { my ($dvd) = @_; use Data::Dumper qw( Dumper ); local $Data::Dumper::Useqq = 1; local $Data::Dumper::Terse = 1; local $Data::Dumper::Indent = 1; print(Dumper($dvd)); } sub fill_bag { my ($bag_size, $items) = @_; my @weights; for my $size (keys(%$items)) { push @weights, ($size) x @{$items->{$size}}; } return undef if !@weights; my $knapsack = Algorithm::Knapsack->new( capacity => $bag_size, weights => \@weights, ); $knapsack->compute(); my $solution = ( $knapsack->solutions() )[0] or die("No solution\n"); my @bag; for my $size (@$solution) { my $item = shift( @{$items->{$size}} ); push @bag, $item; } return \@bag; } { my %files; for my $file (...) { my $size = -s $file; push @{ $files{$size} }, $file; } while (my $dvd = fill_bag(DVD_CAPACITY, \%files)) { process_dvd($dvd); } }

        Untested.

        Update: Added solution.
        Update: Fixed solution.

        AFAICT (after having played around a bit with the module), Algorithm::Knapsack only seems to compute exact solutions (i.e. where capacity is filled entirely), an approach which might not be efficiently applicable to a set of real world file sizes... Also, the computation time gets prohibitively long for file sets as shown below.

        So, here's another approach using the related Algorithm::Bucketizer.  Note that this module doesn't claim to provide optimal solutions (in the sense of the knapsack problem).  But most likely good enough for most practical purposes:

        #!/usr/bin/perl use strict; use warnings; use Algorithm::Bucketizer; my @filesizes = map { (stat)[7] } glob "8*.pl"; print "input (file sizes):\n@filesizes\n\n"; my $b = Algorithm::Bucketizer->new( bucketsize => 10_000, algorithm => "retry", ); for my $i (0..$#filesizes) { $b->add_item($i, $filesizes[$i]); } $b->optimize( algorithm => "random", maxrounds => 1000, ); print "total = file sizes...\n"; for my $bucket ($b->buckets()) { my $total = $bucket->level(); print "$total = @filesizes[ $bucket->items() ]\n"; } __END__ input (file sizes): 361 1304 1988 200 132 715 236 603 674 200 319 319 508 1027 291 154 206 + 1781 735 2625 424 999 1152 110 43 79 421 181 1240 667 67 1085 179 13 +0 593 2240 2925 182 189 566 414 1833 898 1689 4959 87 513 111 1769 91 +0 262 2129 573 2506 363 325 4044 5298 193 298 195 132 826 229 294 198 + 103 857 898 172 89 117 11211 693 390 663 482 1523 387 1121 188 1197 +283 2530 459 1643 1265 636 146 8735 852 2584 372 781 1257 886 174 176 + 637 601 466 1327 164 134 962 1340 total = file sizes... 9959 = 134 2925 513 1340 466 603 424 910 1327 229 193 43 852 9972 = 1769 319 67 1689 1121 130 110 2584 283 1304 390 206 9947 = 667 826 459 1265 674 262 372 508 1240 154 89 236 189 181 1197 7 +81 715 132 9988 = 8735 387 573 176 117 9956 = 4044 1781 174 2240 898 87 179 421 132 9953 = 1152 1257 1643 195 886 663 693 414 198 735 482 566 319 200 79 1 +11 172 188 9974 = 2129 1988 898 4959 9987 = 5298 637 1085 2530 291 146 9942 = 2625 999 1523 593 857 363 2506 182 294 6149 = 1833 164 601 962 200 325 1027 298 103 636
Re: I want to maximize data storage in a DVD
by scorpio17 (Canon) on Mar 31, 2010 at 14:37 UTC

    This isn't a perl solution, but it's how I like to do backups:

    (Assuming you want to backup the /data directory)

    tar cvzf - /data/ | split -d -b 4000m - /backup/backup.tgz

    This creates files with names like backup.tgz01, backup.tgz02, backup.tgz03, etc. Each file is about 4G, which fits on a DVD-W.

    To unpack the archives, just do this:

    cat backup.tgz* | tar xvzf -
Re: I want to maximize data storage in a DVD
by ikegami (Patriarch) on Mar 31, 2010 at 22:47 UTC

    Something that wasn't covered is that you'll need to round up your file sizes to the nearest block size, whatever it is for the medium and file system in question.

    You'll also have to take into consideration that directories take some amount of space.