It looks like you are splitting up files to put on a fixed-size medium (DVD perhaps?). How important is it that your solution be optimal (i.e. minimizing the number of filesets?). If that is not an important constraint, you could build buckets on a first-come-first serve basis quite easily using File::Find, something like this (warning: untested code)
use File::Find;
use File::stat;
our @current_bucket;
our $current_size = 0;
our $BUCKET_SIZE = 4000000000;
sub add_to_bucket {
my $size = stat($_)->size();
if ($size + $current_size > $BUCKET_SIZE) {
# reset bucket
process_bucket(\@current_bucket);
@current_bucket = ();
$current_size = 0;
}
$current_size += $size;
push @current_bucket, $File::Find::name;
}
sub process_bucket {
my $bucket = shift;
# here do things like compress the list to contain
# parent directories, etc, if you really need to do this
# then print out the list.
}
find(\&add_to_bucket, "/data");
That should help you structure your problem....
If it is important that your solution be optimal, be warned that it is a hard algorithmic problem (its a form of the partitioning problem). You could still use File::Find to get the files, but the bucket forming and processing would need to be much more complex (and probably not worth it, though not knowing your precise needs I cannot say for sure...).
Best of luck..
--JAS | [reply] [d/l] |
Thanks so much, I see the light now. This logic along with Mike's (RMGir) Algorithm::Bucketizer insight will get me all the way home.
I plan on building the total size for each of the "project_1" trees and pushing them into Algorithm::Bucketizer if they are < 4.2 gig. If not I will jump down one tree level from "project_1" and use a method similar to the one you show to create a bucket_like object that I can then insert into Algorithm::Bucketizer until I am out of data in the sub tree.
After all that, Algorithm::Bucketizer can optimize the distribution over the buckets with:
$b->optimize(algorithm => "brute_force");
With a smaller number if items in the buckets brute forcing the "Knapsack Problem 0-1" should not be too time consuming (thank god it is not fractional =).
I think this will cover all of my requirements.
Thanks so much JAS and Mike for pointing me in the right direction! =)
-Waswas
| [reply] [d/l] |
I think Algorithm::Bucketizer might be a good starting point.
It doesn't have your "keep things together" constraint it looks like, but you might be able to adapt it for your needs.
--
Mike | [reply] |
Have a look at File::Find, it will give you a great place to start. | [reply] |
Yep, I am familiar with File::Find, I just could not come up with the algorithm to accomplish the partitioning of the data once I recursed the structure with File::Find. Thanks for pointing it out -- I am sorry I did not explain I was planning on using File::Find to generate the initial dataset.
Thanks!
-Wade Stuart
| [reply] |