Re^3: Creating a sub-list constrained by an average while maintaining maximum number of original elements

Glad to have helped.

You may realise, but it's worth stating in case you don't, the algorithm I used is in no way guaranteed to find the optimum solution. Only a very good one very quickly.

The problem with finding a more optimum solution is that there is no discernible pattern in the relationship nor ordering of the averages generated from even the order subsets of the set.

That last statement is best illustrated by calculating all the averages of all the possible subsets of a small set:

c:\test>858276-2
{
  1           => 1,
  "1 2"       => "1.5",
  "1 2 3"     => 2,
  "1 2 3 4"   => "2.5",
  "1 2 3 4 5" => 3,
  "1 2 3 5"   => "2.75",
  "1 2 4"     => "2.33333333333333",
  "1 2 4 5"   => 3,
  "1 2 5"     => "2.66666666666667",
  "1 3"       => 2,
  "1 3 4"     => "2.66666666666667",
  "1 3 4 5"   => "3.25",
  "1 3 5"     => 3,
  "1 4"       => "2.5",
  "1 4 5"     => "3.33333333333333",
  "1 5"       => 3,
  2           => 2,
  "2 3"       => "2.5",
  "2 3 4"     => 3,
  "2 3 4 5"   => "3.5",
  "2 3 5"     => "3.33333333333333",
  "2 4"       => 3,
  "2 4 5"     => "3.66666666666667",
  "2 5"       => "3.5",
  3           => 3,
  "3 4"       => "3.5",
  "3 4 5"     => 4,
  "3 5"       => 4,
  4           => 4,
  "4 5"       => "4.5",
  5           => 5,
}
[download]

This lack of a pattern makes it impossible to come up with heuristics that are guaranteed to converge toward an optimal solution.

And given that for a set of 100,000 items, there are 9.99×10³⁰¹⁰²(*) possible subsets to consider, without some way to compare two partial solutions and rate one "better" than the other as a basis for further exploration, there is no way to construct even a Genetic Algorithm, (which are often useful in NP complete problems), for this problem.

Whilst it will be possible to construct heuristics that will sometimes give better results than my algorithm for particular datasets; I don't believe it is possible to come up with one that will guarantee to produce a better results for all datasets. Even finding the occasional improvements is likely to take 2 or 3 magnitudes greater resources.

(*)A number that is so enormous that few people will be able to comprehend just how big it truly is. It is so big, that it defies any attempts to try and "humanise" it. For example:

If the 10 billion stars in the Universe each had 10 earth-like planets; and each planet had 10 billion people; and each person had Google's resources, (say) 1 million cpus each; and each cpu had been testing one subset per nanosecond since the birth of the universe; then it would still require 10³⁰⁰⁵⁰ more 'lives of the universe' to complete the problem.

It doesn't really help much does it :)

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

RIP an inspiration; A true Folk's Guy

Comment on Re^3: Creating a sub-list constrained by an average while maintaining maximum number of original elements Download Code