Jack B. Nymbol has asked for the wisdom of the Perl Monks concerning the following question:

I have a list of 100,000 files with file sizes
size filename ----- -------- 4329 file1 12311 file2 ... 657 file100000
If I calculate the average file size of the list, as it is, the average may be too high or too low for the target average I'm seeking. What I am trying to do is generate a new list that will match (or be within ~500 bytes) of a target average file size.
./prog 25000 file.lst > new.25000.lst
One idea I had was to generate buckets of files at a resolution of 1024 bytes and then create the new list by plucking files from there, and chucking some from the new list to meet the efficiency goal (see below). I would also like to maintain some randomness to the file sizes. Don't want to have clumping near the target average or something like that. Fast would be good, but efficient is more important. i.e. efficient = maintain as many of the elements of the original as possible.
  • Comment on Creating a sub-list constrained by an average while maintaining maximum number of original elements
  • Select or Download Code

Replies are listed 'Best First'.
Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by CountZero (Bishop) on Sep 01, 2010 at 06:22 UTC
    That is not so much a Perl-question as one related to finding an algorithm, but still very interesting.

    My take on that would be:

    1. Calculate the average of the whole list: if you are near the average, you're done; if not, go to step 2.
    2. Calculate how much bytes you must "gain" or "loose" to reach the average and find a file which is as close as possible to that figure and eliminate that file (to "gain" bytes on your average you must eliminate a file with less bytes than your average, so the average increases). Note: to calculate the "new" average, remember that you now have one less file in your list! After eliminating that file, call the new list your "whole" file and feed it to step 1.

    For sure this algorithm is not guaranteed to provide a solution, but provided your list is big enough and the lengths of the files are sufficiently well spread out, it should not be too bad (for various appropriate definitions of "big enough", "sufficiently" and "not too bad").

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      CountZero...

      I am truly humbled by your reply to the OP. Your response was quick, and eerily accurate. I understood what he was asking only after you had replied.

      I am dazzled by your ability to so rapidly dissect such a problem into its parts, and then provide an easy-to-understand description of the mathematical formula.

      Steve

        Thank you for this praise, but really it is what I was trained to do.

        Yes, I am a lawyer (not a trained programmer) and as a lawyer we are trained to analyse a problem (most of the time based on incomplete or even wrong data), then synthesize a solution and present this solution using language (the dreaded "legalese").

        It is not that different what a analyst / programmer does!

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by moritz (Cardinal) on Sep 01, 2010 at 08:54 UTC
    Not a solution, just an unfinished thought: Maybe you can abuse Algorithm::Bucketizer or Algorithm::BinPack for your purposes - or take a look at their optimization algorithms, and adapt them for your need.

    The problems are related, though not quite the same.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by BrowserUk (Patriarch) on Sep 01, 2010 at 11:13 UTC

    Discarding the required number of largest or smallest is the only way to achieve your "efficiency" definition. This takes a second or two for random lists of 100,000 (by avoiding recalculating the average):

    #! perl -slw use strict; use List::Util qw[ sum ]; use Math::Random::MT qw[ rand srand ]; our $MAX //= 2**24; our $N //= 1e5; our $S //= 1; our $TARGET //= 5e6; srand( $S ); my $nameRoot = 'fileAAAAA'; my %files = map { $nameRoot++ => int( rand $MAX ) } 1 .. $N; my @namesByValue = sort{ $files{ $a } <=> $files{ $b } } keys %files; my @orderedValues = @files{ @namesByValue }; my $ave = sum( @orderedValues ) / @orderedValues ; print "Original average: ", $ave; my @namesToDiscard; if( $ave > $TARGET ) { ## Loose some big ones. while( $ave > ( $TARGET + 500 ) ) { push @namesToDiscard, pop @namesByValue; my $change = ( pop( @orderedValues ) - $ave ) / @orderedValues +; $ave -= $change; } } else { ## loose some small ones; while( $ave < ( $TARGET + 500 ) ) { push @namesToDiscard, shift @namesByValue; my $change = ( $ave - shift @orderedValues ) / @orderedValues +; $ave += $change; } } printf "TARGET: $TARGET; Achieved ave:$ave (using %d files)\n", scalar @orderedValues; delete @files{ @namesToDiscard }; __END__ c:\test>858276 -TARGET=5e6 Original average: 8374785.96135 TARGET: 5e6; Achieved ave:5000496.12061175 (using 59762 files) c:\test>858276 -TARGET=10e6 Original average: 8374785.96135 TARGET: 10e6; Achieved ave:10000541.8121829 (using 80621 files) c:\test>858276 -TARGET=7.5e6 Original average: 8374785.96135 TARGET: 7.5e6; Achieved ave:7500462.85185227 (using 89593 files)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk, Changing the Tavg and Tmargin of course effects the efficiency but it's minor.
      Original average: 29521.3619337317 for 92050 files Target avg.: 15000; Achieved ave: 15979.89 (using 92017 files) Efficiency: 99.96% (92050 originals + 92017 finals)
      Thanks, perlmonks is a great place.

        Glad to have helped.

        You may realise, but it's worth stating in case you don't, the algorithm I used is in no way guaranteed to find the optimum solution. Only a very good one very quickly.

        The problem with finding a more optimum solution is that there is no discernible pattern in the relationship nor ordering of the averages generated from even the order subsets of the set.

        That last statement is best illustrated by calculating all the averages of all the possible subsets of a small set:

        c:\test>858276-2 { 1 => 1, "1 2" => "1.5", "1 2 3" => 2, "1 2 3 4" => "2.5", "1 2 3 4 5" => 3, "1 2 3 5" => "2.75", "1 2 4" => "2.33333333333333", "1 2 4 5" => 3, "1 2 5" => "2.66666666666667", "1 3" => 2, "1 3 4" => "2.66666666666667", "1 3 4 5" => "3.25", "1 3 5" => 3, "1 4" => "2.5", "1 4 5" => "3.33333333333333", "1 5" => 3, 2 => 2, "2 3" => "2.5", "2 3 4" => 3, "2 3 4 5" => "3.5", "2 3 5" => "3.33333333333333", "2 4" => 3, "2 4 5" => "3.66666666666667", "2 5" => "3.5", 3 => 3, "3 4" => "3.5", "3 4 5" => 4, "3 5" => 4, 4 => 4, "4 5" => "4.5", 5 => 5, }

        This lack of a pattern makes it impossible to come up with heuristics that are guaranteed to converge toward an optimal solution.

        And given that for a set of 100,000 items, there are 9.99×1030102(*) possible subsets to consider, without some way to compare two partial solutions and rate one "better" than the other as a basis for further exploration, there is no way to construct even a Genetic Algorithm, (which are often useful in NP complete problems), for this problem.

        Whilst it will be possible to construct heuristics that will sometimes give better results than my algorithm for particular datasets; I don't believe it is possible to come up with one that will guarantee to produce a better results for all datasets. Even finding the occasional improvements is likely to take 2 or 3 magnitudes greater resources.

        (*)A number that is so enormous that few people will be able to comprehend just how big it truly is. It is so big, that it defies any attempts to try and "humanise" it. For example:

        If the 10 billion stars in the Universe each had 10 earth-like planets; and each planet had 10 billion people; and each person had Google's resources, (say) 1 million cpus each; and each cpu had been testing one subset per nanosecond since the birth of the universe; then it would still require 1030050 more 'lives of the universe' to complete the problem.

        It doesn't really help much does it :)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      But that solves only half of the problem, which was also to keep as much as possible of the original list. By eliminating a big one, you may have overshot your target too much, causing you to eliminate a lot of small ones to make up for it.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        That is a theoretical possibility, but I challenge you to produce a 100,000 file dataset + target average for which it happens?

        (Note: I only ever discard big ones or little ones; never both!)


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by Khen1950fx (Canon) on Sep 01, 2010 at 09:11 UTC
    I've never averaged a sub-list, but here is my stab at it. I think that it meets the requirements of CountZero's Step 1. I'm still stuck on step 2:).
    #!/usr/bin/perl use strict; use warnings; use File::Find; use File::stat; use List::Util qw(sum min max); my %files; my $root_dir = '/root/Desktop/Recent'; my $min_file_size = 25000; &find_files; &create_and_print_output; sub find_files { print "\nTraversing tree $root_dir...\n\n"; find(\&trav_tree, "$root_dir" ); } sub trav_tree { if ( -e $File::Find::name ) { $files{ 'files' }->{ total } += 1; push @{ $files{ 'files' }->{ sizes } }, stat($File::Fin +d::name)->size; if ( stat($File::Find::name)->size >= $min_file_size ) +{ $files{ 'sizes' }->{ stat($File::Find::name)->si +ze }->{ filename } = $File::Find::name; } } } sub create_and_print_output { my $sum = sum( @{ $files{ 'files' }->{ sizes } } ); my $number_of_files = $files{ 'files' }->{ total }; my $average_file_size = $sum / $number_of_files; my $smallest_file_name = $files{ 'sizes' }->{ min ( keys %{ $ +files{ 'sizes' } } ) }->{ filename }; my $smallest_file_size = min ( keys %{ $files{ 'sizes' } } ); my $biggest_file_name = $files{ 'sizes' }->{ max ( keys %{ $f +iles{ 'sizes' } } ) }->{ filename }; my $biggest_file_size = max ( keys %{ $files{ 'sizes' } } ); printf "Number of files: %d\n", $number_of_files; printf "Average file size: %d bytes\n\n", $average_file_size; + printf "Closest to average file found was %s with a size of % +d bytes\n", $smallest_file_name, $smallest_file_size; print "\n"; }
Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by Limbic~Region (Chancellor) on Sep 01, 2010 at 13:52 UTC
    Jack B. Nymbol,
    Here is a greedy approach that I think would produce good results (though I haven't even attempted to test it) and should be relatively fast.
    1. Divide the list into two ordered sublists (above and below the target average)
    2. Take the largest item off the above list
    3. Take as many items off the below list (smallest first) until you are within the target average
    4. Take the smallest item off the below list
    5. Take as many items off the above list (smallest first) until you are within the target average
    6. If either steps 2/3 or 4/5 fail to balance, drop the item from step 2/4 and repeat the process
    7. Repeat until one of the two lists is empty
    8. Of the list that remains, see if any of the remaining items closest to the target can be added to the pile and maintain the target average

    Cheers - L~R

Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by locked_user sundialsvc4 (Abbot) on Sep 01, 2010 at 16:33 UTC

    It is also advantageous, in some cases (say, when dealing with “hundreds of thousands of files”) to do one of two things:

    • Use a file-based sort to sort the list.   You can do many things that rely upon this property of such a list, which can in fact be arbitrarily large.   These algorithms have a much smaller memory footprint and may scale better.
    • Consider randomizing the file, e.g. by writing a random-number field and then sorting the file by that field.   Now, random selection can be made by reading the file sequentially.

    Sorting is an “unexpectedly fast and efficient” approach to many things.   (“Sorting and Searching” is of course the title of one of Dr. Knuth’s definitive textbooks.)  

    You need to ponder the trade-off between “goodness of fit” and “diversity.”   Which of these two considerations is the more important?   It might well be argued that the algorithm should consist of a single pass through the randomized file, always adding the new record to the pool and throwing-out a randomly selected element (say, based on the simple criteria of it being either “larger than” or “smaller than” the target value) with each iteration.

    It might be prudent to model several different candidate algorithms, and then choose among them based upon both your understanding of the problem and their observed statistical characteristics.

Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by Your Mother (Archbishop) on Sep 01, 2010 at 22:06 UTC

    This may be completely off-base (and might not work precisely on odd numbered lists) but is an in-place transform within the scope of the solution? Seems like it might be quite fast, especially if you only need to transform some lines and do that with random skips through the original.

    use warnings; use strict; use IO::File; my ( $sum_orig, $sum_new, $lines, $lines_new ); open my $out, ">", "reorganized.txt" or die $!; while (<DATA>) { my ( $size, $file ) = split; $sum_orig += $size; if ( defined ( $_ = <DATA> ) ) { my ( $size2, $file2 ) = split; $sum_orig += $size2; my $max = $size > $size2 ? $size2 : $size; my $rejigger = int rand($max/2); my $tmp2 = $size - $rejigger; my $tmp1 = $size2 + $rejigger; $out->printf("%-7d %s\n", $tmp1, $file); $out->printf("%-7d %s\n", $tmp2, $file2); $sum_new += $tmp1 + $tmp2; $lines_new += 2; $lines++; } $lines++; } print "Original average: ", $sum_orig / $lines, $/; print " New average: ", $sum_new / $lines_new, $/; __DATA__ 4329 file1 12311 file2 657 file3 4271 file4 20 file5 10001 file6
A reply falls below the community's threshold of quality. You may see it by logging in.