Re: Re: file chunk algorithm

Replies are listed 'Best First'.
Re (tilly) 3: file chunk algorithm by tilly (Archbishop) on Feb 27, 2001 at 03:47 UTC
Here is an OO solution that has fixed memory limits. The API is a little strange, the bucket is destroyed when it becomes full, and upon destruction you should ensure that it gets dumped somewhere... package Bin; use Carp; use strict; # Add an element to a bucket, returns the head of the list. sub add { my $self = shift; my $thing = shift; my $size = shift; if ($size <= ($self->{max_size} - $self->{size})) { $self->{size} += $size; push @{$self->{things}}, $thing; $self->{skipped} = 0; } elsif ($size > $self->{max_size}) { confess("'$thing' larger than a bucket ($size vs $self->{max_size} +)"); } else { $self->{next} \|\|= new Bin(@$self{'max_size', 'max_skips'}); if ($self->{max_skips} < ++$self->{skipped}) { return $self->{next}->add($thing, $size); } else { $self->{next} = $self->{next}->add($thing, $size); } } return $self; } sub DESTROY { my $self = shift; print "Bucket size $self->{size}: @{$self->{things}}\n"; <STDIN>; } sub new { my $self = bless {}, shift; $self->{max_size} = shift \|\| 10_000_000; $self->{max_skips} = shift \|\| 50; $self->{skipped} = 0; $self->{size} = 0; if (@_) { $self->add(@_); } $self; } package main; my $bin = new Bin; foreach my $cnt (9_800..10_001) { foreach (1, 10, 100) { $bin = $bin->add("$cnt\_$_", $cnt * $_); } } undef($bin); [download]	[reply] [d/l]
Re: Re: Re: file chunk algorithm by MeowChow (Vicar) on Feb 26, 2001 at 22:57 UTC
Yes, it does eat up memory when run on a large number of files. You may also be hitting on an issue with Data::Dumper, which snarfs memory like it's going out of style on huge data-structures. Try switching over to an "on-line" bin-packing algorithm as shown below (this one is a basic first-fit), but first try replacing just the `print Dumper ...` with the `mydump` routine below. update: I've confirmed that this was a Data::Dumper issue, at least on my system. When I switched my first response's code to use the `mydump` routine on my 9 gig drive, with 160,000 files and a $max_bin_size of 1 MB (~10,000 total bins), memory use did not exceed 50 MB. With Data::Dumper doing the dump, well, even 700MB of RAM couldn't stop the inevitable. use File::Find; my %filesize; my %bins; my $num_bins = 0; my $max_bin_size = 1010001000; my $top_dir = '.'; $\| = 1; find(\&packer, $top_dir); sub packer { print "finding: $_\n"; my $file = $File::Find::name; my $fsize = -s; my $bin; for (keys %bins) { if ($bins{$_}{size} + $fsize < $max_bin_size) { $bin = $_; last; } } $bin = $num_bins++ if not defined $bin; $bins{$bin}{size} += $fsize; push @{$bins{$bin}{files}}, $file; } sub mydump { my $bins = shift; while (my ($bnum, $bstruct) = each %$bins) { print "Bin Number: $bnum ($bstruct->{size} bytes)\n"; print " $_\n" for @{$bstruct->{files}}; } } print mydump(\%bins); [download] MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l] [select]