waswas-fng has asked for the wisdom of the Perl Monks concerning the following question:

I have been struggling with this for a few days now. Every time I think I have a workable algorithm for this I end up seeing a huge flaw. It screams recursive but I just can't seem to figure one out.

Here is the dataset:

Given a set of directories in this layout:

/data/client1/project_type_1/project_1
/data/client1/project_type_1/project_2
/data/client1/project_type_2/project_3
/data/client2/project_type_1/project_4
/data/client2/project_type_1/project_5
/data/client2/project_type_2/project_6
/data/client3/project_type_1/project_7
/data/client3/project_type_1/project_8
/data/client4/project_type_1/project_9
/data/client5/project_type_2/project_10
/data/client5/project_type_2/project_11

Where in subdirectories there is unstructured data in files that are in unstructured subdirectories under the "project_1" type level.

Where the largest file size is < 2 gb.

Where the total file size for a "project_1" like dir can be > 10 gig or < 50k

Where the "project_1" like field do not have to be named unique, but contain unique data and subdirectory structure.

Here is the problem:

Get a unique lists of files that when added up the total size is > 3.5 gig and less than 4.2 gig while keeping all of a "project_1" together in the same list of filenames unless it is >4.2 gig - at which point it should be broken up into as few lists as possible.

Example
<project> <total size>
/data/client1/project_type_1/project_1 200mb
/data/client1/project_type_1/project_2 3.1 gb
/data/client1/project_type_2/project_3 1.2gb
/data/client2/project_type_1/project_4 7.4gb
/data/client2/project_type_1/project_5 31mb
/data/client2/project_type_2/project_6 400mb
/data/client3/project_type_1/project_7 2.5 gb

The lists returned would be
1:
all of the files in
/data/client1/project_type_1/project_1
/data/client1/project_type_1/project_2
/data/client2/project_type_1/project_5
/data/client2/project_type_2/project_6

2:
/data/client1/project_type_2/project_3
/data/client3/project_type_1/project_7

3:
First 4.2 gigs of files in /data/client2/project_type_1/project_4

4:
Remaining files in /data/client2/project_type_1/project_4



I can't seem to figure out (even in pseudo code) how to do this. I think I am just fried from this project.

Anyone have any ideas? Need more info let me know...


Thanks!
Waswas
  • Comment on My brain hurts! Recursive directory algorithm..

Replies are listed 'Best First'.
Re: My brain hurts! Recursive directory algorithm..
by jsegal (Friar) on Jul 25, 2002 at 18:39 UTC
    It looks like you are splitting up files to put on a fixed-size medium (DVD perhaps?). How important is it that your solution be optimal (i.e. minimizing the number of filesets?). If that is not an important constraint, you could build buckets on a first-come-first serve basis quite easily using File::Find, something like this (warning: untested code)
    use File::Find; use File::stat; our @current_bucket; our $current_size = 0; our $BUCKET_SIZE = 4000000000; sub add_to_bucket { my $size = stat($_)->size(); if ($size + $current_size > $BUCKET_SIZE) { # reset bucket process_bucket(\@current_bucket); @current_bucket = (); $current_size = 0; } $current_size += $size; push @current_bucket, $File::Find::name; } sub process_bucket { my $bucket = shift; # here do things like compress the list to contain # parent directories, etc, if you really need to do this # then print out the list. } find(\&add_to_bucket, "/data");
    That should help you structure your problem....

    If it is important that your solution be optimal, be warned that it is a hard algorithmic problem (its a form of the partitioning problem). You could still use File::Find to get the files, but the bucket forming and processing would need to be much more complex (and probably not worth it, though not knowing your precise needs I cannot say for sure...).
    Best of luck..

    --JAS
      Thanks so much, I see the light now. This logic along with Mike's (RMGir) Algorithm::Bucketizer insight will get me all the way home.

      I plan on building the total size for each of the "project_1" trees and pushing them into Algorithm::Bucketizer if they are < 4.2 gig. If not I will jump down one tree level from "project_1" and use a method similar to the one you show to create a bucket_like object that I can then insert into Algorithm::Bucketizer until I am out of data in the sub tree.


      After all that, Algorithm::Bucketizer can optimize the distribution over the buckets with:
      $b->optimize(algorithm => "brute_force");


      With a smaller number if items in the buckets brute forcing the "Knapsack Problem 0-1" should not be too time consuming (thank god it is not fractional =).

      I think this will cover all of my requirements.


      Thanks so much JAS and Mike for pointing me in the right direction! =)

      -Waswas
Re: My brain hurts! Recursive directory algorithm..
by RMGir (Prior) on Jul 25, 2002 at 18:04 UTC
    I think Algorithm::Bucketizer might be a good starting point.

    It doesn't have your "keep things together" constraint it looks like, but you might be able to adapt it for your needs.
    --
    Mike

Re: My brain hurts! Recursive directory algorithm..
by Abstraction (Friar) on Jul 25, 2002 at 18:03 UTC
    Have a look at File::Find, it will give you a great place to start.
      Yep, I am familiar with File::Find, I just could not come up with the algorithm to accomplish the partitioning of the data once I recursed the structure with File::Find. Thanks for pointing it out -- I am sorry I did not explain I was planning on using File::Find to generate the initial dataset.

      Thanks!
      -Wade Stuart