in reply to Re: file chunk algorithm
in thread file chunk algorithm

Great solution, but the only problem is that eats away at the memory... i keep getting "out of memory error", and that is just creating the filename and size hash... maybe there is still a better way.

I am the first overweight, very tall munchkin... be very amazed.

Replies are listed 'Best First'.
Re (tilly) 3: file chunk algorithm
by tilly (Archbishop) on Feb 27, 2001 at 03:47 UTC
    Here is an OO solution that has fixed memory limits. The API is a little strange, the bucket is destroyed when it becomes full, and upon destruction you should ensure that it gets dumped somewhere...
    package Bin; use Carp; use strict; # Add an element to a bucket, returns the head of the list. sub add { my $self = shift; my $thing = shift; my $size = shift; if ($size <= ($self->{max_size} - $self->{size})) { $self->{size} += $size; push @{$self->{things}}, $thing; $self->{skipped} = 0; } elsif ($size > $self->{max_size}) { confess("'$thing' larger than a bucket ($size vs $self->{max_size} +)"); } else { $self->{next} ||= new Bin(@$self{'max_size', 'max_skips'}); if ($self->{max_skips} < ++$self->{skipped}) { return $self->{next}->add($thing, $size); } else { $self->{next} = $self->{next}->add($thing, $size); } } return $self; } sub DESTROY { my $self = shift; print "Bucket size $self->{size}: @{$self->{things}}\n"; <STDIN>; } sub new { my $self = bless {}, shift; $self->{max_size} = shift || 10_000_000; $self->{max_skips} = shift || 50; $self->{skipped} = 0; $self->{size} = 0; if (@_) { $self->add(@_); } $self; } package main; my $bin = new Bin; foreach my $cnt (9_800..10_001) { foreach (1, 10, 100) { $bin = $bin->add("$cnt\_$_", $cnt * $_); } } undef($bin);
Re: Re: Re: file chunk algorithm
by MeowChow (Vicar) on Feb 26, 2001 at 22:57 UTC
    Yes, it does eat up memory when run on a large number of files. You may also be hitting on an issue with Data::Dumper, which snarfs memory like it's going out of style on huge data-structures.

    Try switching over to an "on-line" bin-packing algorithm as shown below (this one is a basic first-fit), but first try replacing just the print Dumper ... with the mydump routine below.

    update: I've confirmed that this was a Data::Dumper issue, at least on my system. When I switched my first response's code to use the mydump routine on my 9 gig drive, with 160,000 files and a $max_bin_size of 1 MB (~10,000 total bins), memory use did not exceed 50 MB. With Data::Dumper doing the dump, well, even 700MB of RAM couldn't stop the inevitable.

    use File::Find; my %filesize; my %bins; my $num_bins = 0; my $max_bin_size = 10*1000*1000; my $top_dir = '.'; $| = 1; find(\&packer, $top_dir); sub packer { print "finding: $_\n"; my $file = $File::Find::name; my $fsize = -s; my $bin; for (keys %bins) { if ($bins{$_}{size} + $fsize < $max_bin_size) { $bin = $_; last; } } $bin = $num_bins++ if not defined $bin; $bins{$bin}{size} += $fsize; push @{$bins{$bin}{files}}, $file; } sub mydump { my $bins = shift; while (my ($bnum, $bstruct) = each %$bins) { print "Bin Number: $bnum ($bstruct->{size} bytes)\n"; print " $_\n" for @{$bstruct->{files}}; } } print mydump(\%bins);
       MeowChow                                   
                   s aamecha.s a..a\u$&owag.print