file chunk algorithm

thealienz1 has asked for the wisdom of the Perl Monks concerning the following question:

Oh magical ones...

Okay currently I am writing some scripts for a file management site, and I just need some help more on an algorithm.

What I am trying to do:

Take a directory of files and sort them into directories in specified size chunks. So, if i had 50 one megabyte (more or less) files, and wanted them to be sorted into 10 megabyte directories... so theorically there should be 10 files in each dir.

What I have so far:

#Used to find files
use File::Find;
#Used to move files to new location
use File::Copy;

#Some defined variables
my $dir_count = 1; #Keeps count of directories of each chunk
my $total_size = 0; #Used to count toltal count of file sizes
my $dir_chunk = 10000000; #Size directory should be (aka chunk)

#Starts the actual File Search & Replace
print "Starting Search and Replace - \n";
find (\&eachFile, "/asdf/");
print "Completed\n";

sub eachFile {
        my $filename = $_;
        my $fullpath = $File::Find::name;
        #remember that File::Find changes your CWD, 
        #so you can call open with just $_
    if(-f "$fullpath"){
        my $size = -s $fullpath;
        $total_size+=$size;
        if($total_size > $dir_chunk) {
            $total_size=$size;
            $cd_count++;
        }
        my $move_dir = "/chunks/$cd_count/";
        if(!(-e "$move_dir")) {mkdir("$move_dir",0777);}
        print "\t$filename to $move_dir\n";
        move("$fullpath","$move_dir$filename");
    }
}
[download]

My Question:

Is there a more efficient way to do this? My coding is kind of crude, and some of the directories end up being 8 megs, and 9 megs, when there could be a more efficient algorithem to make sure that it gets as close to 10megs as possible...

I leave it open for helping me... Thanks

I am the first overweight, very tall munchkin... be very amazed.

Comment on file chunk algorithm Download Code

Replies are listed 'Best First'.
Re: file chunk algorithm by MeowChow (Vicar) on Feb 26, 2001 at 06:01 UTC
If you manage to come up with an optimal solution, let me know :-) What you are doing is equivalent to bin-packing, which is an NP-complete problem. Your codes employs a simple "next-fit" algorithm, which is somewhat inefficient, but easy to code. Better heuristics exist, such as first fit, best fit, and first-fit/best-fit decreasing. One good way to do this (first fit decreasing) is to construct a hash of your files and their associated sizes, sort it in decreasing order of size, and pack the biggest items first. Keep track of each bin's free space, and continue adding items to the first bin in which it can fit. Add additional bins only when you can't find a fit. Update: Couldn't resist the lure of an algorithm implementation... use Data::Dumper; use File::Find; my %filesize; my %bins; my $num_bins = 0; my $max_bin_size = 1010001000; my $top_dir = '.'; find(sub { $filesize{$File::Find::name} = -s }, $top_dir); for my $file (sort { $filesize{$b} <=> $filesize{$a} } keys %filesize) + { my $fsize = $filesize{$file}; my $bin; for (keys %bins) { if ($bins{$_}{size} + $fsize < $max_bin_size) { $bin = $_; last; } } $bin = $num_bins++ if not defined $bin; $bins{$bin}{size} += $fsize; push @{$bins{$bin}{files}}, $file; } print Dumper(\%bins); [download] MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l]
Re: Re: file chunk algorithm by thealienz1 (Pilgrim) on Feb 26, 2001 at 20:32 UTC
Great solution, but the only problem is that eats away at the memory... i keep getting "out of memory error", and that is just creating the filename and size hash... maybe there is still a better way. *I am the first overweight, very tall munchkin... be very amazed.*	[reply]
Re (tilly) 3: file chunk algorithm by tilly (Archbishop) on Feb 27, 2001 at 03:47 UTC
Here is an OO solution that has fixed memory limits. The API is a little strange, the bucket is destroyed when it becomes full, and upon destruction you should ensure that it gets dumped somewhere... package Bin; use Carp; use strict; # Add an element to a bucket, returns the head of the list. sub add { my $self = shift; my $thing = shift; my $size = shift; if ($size <= ($self->{max_size} - $self->{size})) { $self->{size} += $size; push @{$self->{things}}, $thing; $self->{skipped} = 0; } elsif ($size > $self->{max_size}) { confess("'$thing' larger than a bucket ($size vs $self->{max_size} +)"); } else { $self->{next} \|\|= new Bin(@$self{'max_size', 'max_skips'}); if ($self->{max_skips} < ++$self->{skipped}) { return $self->{next}->add($thing, $size); } else { $self->{next} = $self->{next}->add($thing, $size); } } return $self; } sub DESTROY { my $self = shift; print "Bucket size $self->{size}: @{$self->{things}}\n"; <STDIN>; } sub new { my $self = bless {}, shift; $self->{max_size} = shift \|\| 10_000_000; $self->{max_skips} = shift \|\| 50; $self->{skipped} = 0; $self->{size} = 0; if (@_) { $self->add(@_); } $self; } package main; my $bin = new Bin; foreach my $cnt (9_800..10_001) { foreach (1, 10, 100) { $bin = $bin->add("$cnt\_$_", $cnt * $_); } } undef($bin); [download]	[reply] [d/l]
Re: Re: Re: file chunk algorithm by MeowChow (Vicar) on Feb 26, 2001 at 22:57 UTC
Yes, it does eat up memory when run on a large number of files. You may also be hitting on an issue with Data::Dumper, which snarfs memory like it's going out of style on huge data-structures. Try switching over to an "on-line" bin-packing algorithm as shown below (this one is a basic first-fit), but first try replacing just the `print Dumper ...` with the `mydump` routine below. update: I've confirmed that this was a Data::Dumper issue, at least on my system. When I switched my first response's code to use the `mydump` routine on my 9 gig drive, with 160,000 files and a $max_bin_size of 1 MB (~10,000 total bins), memory use did not exceed 50 MB. With Data::Dumper doing the dump, well, even 700MB of RAM couldn't stop the inevitable. use File::Find; my %filesize; my %bins; my $num_bins = 0; my $max_bin_size = 1010001000; my $top_dir = '.'; $\| = 1; find(\&packer, $top_dir); sub packer { print "finding: $_\n"; my $file = $File::Find::name; my $fsize = -s; my $bin; for (keys %bins) { if ($bins{$_}{size} + $fsize < $max_bin_size) { $bin = $_; last; } } $bin = $num_bins++ if not defined $bin; $bins{$bin}{size} += $fsize; push @{$bins{$bin}{files}}, $file; } sub mydump { my $bins = shift; while (my ($bnum, $bstruct) = each %$bins) { print "Bin Number: $bnum ($bstruct->{size} bytes)\n"; print " $_\n" for @{$bstruct->{files}}; } } print mydump(\%bins); [download] MeowChow s aamecha.s a..a\u$&owag.print	[reply] [d/l] [select]
Re: file chunk algorithm (a GA way) by Malkavian (Friar) on Feb 26, 2001 at 15:41 UTC
A node that just seems to beg for attention on this is the one on Genetic Algorithms. If best fit of all files is you grail, rather than speed, GAs are looking like a good candidate for this. As an initial thought on applying the GA: Each section of the gene string represents a file number, the length of each allele (effectively the section of a gene that equates to a rule) is determined by the number of directories you have, and is, in essence a binary encoding of a directory 'number'. Thus if you had 50 files to put in 10 directories you'd have an allele size of 4 bits (minimum binary length to encode 10 directories), and total length of gene of 450 (one allele for each file). A test of fitness would be must* be under 10MB (if you absolutely need it under 10Meg), so a BIG penalty needs to be applied for each directory going over 10meg. The reason for a big penalty rather than complete invalidation is that much of the gene may be useful, but it should never be in the final solution. High scoring ones with one bum directory should still be candidates for crossover. Then work out a scoring system for deviation from the 10MB mark that achieves the best fit you're looking for (some solutions will pack some directories to 10MB, and leave one or two with a lot of space, some rules will lead to most directories being near packed, take your pick).. Populate the gene pool, and let it rip. :) Leave it running for as long as you feel necessary (keep an eye on the total fitness of the system, and get it to stop when you think it's good enough, or you're tired of watching it run); a few thousand generations should be good.. That should give you a pretty decent solution to your problem For more info, check out: geneticprogramming.com. Hope this helps some, Malk	[reply]
Re: file chunk algorithm by thealienz1 (Pilgrim) on Feb 26, 2001 at 05:05 UTC
Oye, saw something wrong with my code sorry... here is the fixed version.... #Used to find files use File::Find; #Used to move files to new location use File::Copy; #Some defined variables my $dir_count = 1; #Keeps count of directories of each chunk my $total_size = 0; #Used to count toltal count of file sizes my $dir_chunk = 10000000; #Size directory should be (aka chunk) #Starts the actual File Search & Replace print "Starting Search and Replace - \n"; find (\&eachFile, "/asdf/"); print "Completed\n"; sub eachFile { my $filename = $_; my $fullpath = $File::Find::name; #remember that File::Find changes your CWD, #so you can call open with just $_ if(-f "$fullpath"){ my $size = -s $fullpath; $total_size+=$size; if($total_size > $dir_chunk) { $total_size=$size; $dir_count++; } my $move_dir = "/chunks/$dir_count/"; if(!(-e "$move_dir")) {mkdir("$move_dir",0777);} print "\t$filename to $move_dir\n"; move("$fullpath","$move_dir$filename"); } } [download] *I am the first overweight, very tall munchkin... be very amazed.*	[reply] [d/l]