in reply to Creating a sub-list constrained by an average while maintaining maximum number of original elements

Discarding the required number of largest or smallest is the only way to achieve your "efficiency" definition. This takes a second or two for random lists of 100,000 (by avoiding recalculating the average):

#! perl -slw use strict; use List::Util qw[ sum ]; use Math::Random::MT qw[ rand srand ]; our $MAX //= 2**24; our $N //= 1e5; our $S //= 1; our $TARGET //= 5e6; srand( $S ); my $nameRoot = 'fileAAAAA'; my %files = map { $nameRoot++ => int( rand $MAX ) } 1 .. $N; my @namesByValue = sort{ $files{ $a } <=> $files{ $b } } keys %files; my @orderedValues = @files{ @namesByValue }; my $ave = sum( @orderedValues ) / @orderedValues ; print "Original average: ", $ave; my @namesToDiscard; if( $ave > $TARGET ) { ## Loose some big ones. while( $ave > ( $TARGET + 500 ) ) { push @namesToDiscard, pop @namesByValue; my $change = ( pop( @orderedValues ) - $ave ) / @orderedValues +; $ave -= $change; } } else { ## loose some small ones; while( $ave < ( $TARGET + 500 ) ) { push @namesToDiscard, shift @namesByValue; my $change = ( $ave - shift @orderedValues ) / @orderedValues +; $ave += $change; } } printf "TARGET: $TARGET; Achieved ave:$ave (using %d files)\n", scalar @orderedValues; delete @files{ @namesToDiscard }; __END__ c:\test>858276 -TARGET=5e6 Original average: 8374785.96135 TARGET: 5e6; Achieved ave:5000496.12061175 (using 59762 files) c:\test>858276 -TARGET=10e6 Original average: 8374785.96135 TARGET: 10e6; Achieved ave:10000541.8121829 (using 80621 files) c:\test>858276 -TARGET=7.5e6 Original average: 8374785.96135 TARGET: 7.5e6; Achieved ave:7500462.85185227 (using 89593 files)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP an inspiration; A true Folk's Guy
  • Comment on Re: Creating a sub-list constrained by an average while maintaining maximum number of original elements
  • Download Code

Replies are listed 'Best First'.
Re^2: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by Jack B. Nymbol (Acolyte) on Sep 01, 2010 at 19:35 UTC
    BrowserUk, Changing the Tavg and Tmargin of course effects the efficiency but it's minor.
    Original average: 29521.3619337317 for 92050 files Target avg.: 15000; Achieved ave: 15979.89 (using 92017 files) Efficiency: 99.96% (92050 originals + 92017 finals)
    Thanks, perlmonks is a great place.

      Glad to have helped.

      You may realise, but it's worth stating in case you don't, the algorithm I used is in no way guaranteed to find the optimum solution. Only a very good one very quickly.

      The problem with finding a more optimum solution is that there is no discernible pattern in the relationship nor ordering of the averages generated from even the order subsets of the set.

      That last statement is best illustrated by calculating all the averages of all the possible subsets of a small set:

      c:\test>858276-2 { 1 => 1, "1 2" => "1.5", "1 2 3" => 2, "1 2 3 4" => "2.5", "1 2 3 4 5" => 3, "1 2 3 5" => "2.75", "1 2 4" => "2.33333333333333", "1 2 4 5" => 3, "1 2 5" => "2.66666666666667", "1 3" => 2, "1 3 4" => "2.66666666666667", "1 3 4 5" => "3.25", "1 3 5" => 3, "1 4" => "2.5", "1 4 5" => "3.33333333333333", "1 5" => 3, 2 => 2, "2 3" => "2.5", "2 3 4" => 3, "2 3 4 5" => "3.5", "2 3 5" => "3.33333333333333", "2 4" => 3, "2 4 5" => "3.66666666666667", "2 5" => "3.5", 3 => 3, "3 4" => "3.5", "3 4 5" => 4, "3 5" => 4, 4 => 4, "4 5" => "4.5", 5 => 5, }

      This lack of a pattern makes it impossible to come up with heuristics that are guaranteed to converge toward an optimal solution.

      And given that for a set of 100,000 items, there are 9.99×1030102(*) possible subsets to consider, without some way to compare two partial solutions and rate one "better" than the other as a basis for further exploration, there is no way to construct even a Genetic Algorithm, (which are often useful in NP complete problems), for this problem.

      Whilst it will be possible to construct heuristics that will sometimes give better results than my algorithm for particular datasets; I don't believe it is possible to come up with one that will guarantee to produce a better results for all datasets. Even finding the occasional improvements is likely to take 2 or 3 magnitudes greater resources.

      (*)A number that is so enormous that few people will be able to comprehend just how big it truly is. It is so big, that it defies any attempts to try and "humanise" it. For example:

      If the 10 billion stars in the Universe each had 10 earth-like planets; and each planet had 10 billion people; and each person had Google's resources, (say) 1 million cpus each; and each cpu had been testing one subset per nanosecond since the birth of the universe; then it would still require 1030050 more 'lives of the universe' to complete the problem.

      It doesn't really help much does it :)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        No, I was aware of the problem. Your attempt at bringing the size down to Earth was not bad. A nice one I read once was in a book A. C. Clarke wrote about his experience working on 2001 with Kubrick, "Lost Worlds of 2001". Bringing large numbers/sizes to human-scale is not easy. Like how many ipv6 addresses there are.... My initial brute-force attempts caused the laptop to overheat and go into sleep. (^: There's some middle-ground tuning as others have pointed out but it can wait for now.
Re^2: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by CountZero (Bishop) on Sep 01, 2010 at 11:49 UTC
    But that solves only half of the problem, which was also to keep as much as possible of the original list. By eliminating a big one, you may have overshot your target too much, causing you to eliminate a lot of small ones to make up for it.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      That is a theoretical possibility, but I challenge you to produce a 100,000 file dataset + target average for which it happens?

      (Note: I only ever discard big ones or little ones; never both!)


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.