in reply to Re^2: Creating a sub-list constrained by an average while maintaining maximum number of original elements
in thread Creating a sub-list constrained by an average while maintaining maximum number of original elements
Glad to have helped.
You may realise, but it's worth stating in case you don't, the algorithm I used is in no way guaranteed to find the optimum solution. Only a very good one very quickly.
The problem with finding a more optimum solution is that there is no discernible pattern in the relationship nor ordering of the averages generated from even the order subsets of the set.
That last statement is best illustrated by calculating all the averages of all the possible subsets of a small set:
c:\test>858276-2 { 1 => 1, "1 2" => "1.5", "1 2 3" => 2, "1 2 3 4" => "2.5", "1 2 3 4 5" => 3, "1 2 3 5" => "2.75", "1 2 4" => "2.33333333333333", "1 2 4 5" => 3, "1 2 5" => "2.66666666666667", "1 3" => 2, "1 3 4" => "2.66666666666667", "1 3 4 5" => "3.25", "1 3 5" => 3, "1 4" => "2.5", "1 4 5" => "3.33333333333333", "1 5" => 3, 2 => 2, "2 3" => "2.5", "2 3 4" => 3, "2 3 4 5" => "3.5", "2 3 5" => "3.33333333333333", "2 4" => 3, "2 4 5" => "3.66666666666667", "2 5" => "3.5", 3 => 3, "3 4" => "3.5", "3 4 5" => 4, "3 5" => 4, 4 => 4, "4 5" => "4.5", 5 => 5, }
This lack of a pattern makes it impossible to come up with heuristics that are guaranteed to converge toward an optimal solution.
And given that for a set of 100,000 items, there are 9.99×1030102(*) possible subsets to consider, without some way to compare two partial solutions and rate one "better" than the other as a basis for further exploration, there is no way to construct even a Genetic Algorithm, (which are often useful in NP complete problems), for this problem.
Whilst it will be possible to construct heuristics that will sometimes give better results than my algorithm for particular datasets; I don't believe it is possible to come up with one that will guarantee to produce a better results for all datasets. Even finding the occasional improvements is likely to take 2 or 3 magnitudes greater resources.
(*)A number that is so enormous that few people will be able to comprehend just how big it truly is. It is so big, that it defies any attempts to try and "humanise" it. For example:
If the 10 billion stars in the Universe each had 10 earth-like planets; and each planet had 10 billion people; and each person had Google's resources, (say) 1 million cpus each; and each cpu had been testing one subset per nanosecond since the birth of the universe; then it would still require 1030050 more 'lives of the universe' to complete the problem.
It doesn't really help much does it :)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^4: Creating a sub-list constrained by an average while maintaining maximum number of original elements
by Jack B. Nymbol (Acolyte) on Sep 02, 2010 at 00:42 UTC |