Re^15: Divide array of integers into most similar value halves (Updated)

Update: I know know why the problem didn't show up in my version of the script. As I mentioned, I'm using a custom shuffle routine, which does an in-place shuffle. The List::Util version of shuffle is not in-place, hence the last line of the for loop should be @in = shuffle @in; for use with List::Util, not shuffle @in; as for my version. I've update the code below.

I apologise. It is a legit bug. I'm amazed it hasn't shown up before and confused as to why it doesn't show up in my lightly modified version of the same script you posted, but...this fixes it. Till the next one....

sub partition {
    my( $limit, $aRef ) = @_;
    my @in = sort{ $a <=> $b } @$aRef;
    my $target = sum( @in ) >> 1;
    my( $best, @best ) = 9e99;
    my $soFar = 0;
    my @half;
    for( 1 .. $limit ) {
#        printf "%6d : [@half] [@in] [@best]\n", abs( $soFar - $target
+ );

        $soFar += $in[ 0 ], push @half, shift @in
            while @in > 1 and $soFar < $target;
        return( \@half, \@in ) if $soFar == $target;

        my $diff = abs( $soFar - $target );
        ( $best, @best ) = ( $diff, @half ) if $diff < $best;

        $soFar -= $half[ 0 ], push @in, shift @half
            while @half > 1 and $soFar > $target;
        return( \@half, \@in ) if $soFar == $target;

        $diff = abs( $soFar - $target );
        ( $best, @best ) = ( $diff, @half ) if $diff < $best;

        srand( $_ );
        @in = shuffle @in;
        ## shuffle @in; ## Only works for an inplace shuffle!
    }
    my %seen; $seen{ $_ }++ for @best;
    return \@best, [ grep{ --$seen{ $_ } < 0 } @$aRef ];
}
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Comment on Re^15: Divide array of integers into most similar value halves (Updated) Select or Download Code

Replies are listed 'Best First'.
Re^16: Divide array of integers into most similar value halves by Pepe (Sexton) on Sep 04, 2008 at 22:46 UTC
Thank you so much BrowserUk! It's working so far in a relatively small number of arrays (1000). I already found one script that's giving me no error in tens of thousands on arrays. You can get it from the conversation, posted by 'tilly'. For now I think I'll stick to that one. But after two days of trying everything I found here, I believe I must try yours together with tilly's and see if both of you have reached the same nirvana. Just for the heck of being able to thank everyone that helped, and post a couple of scripts that work undeniably well. My most sincere thanks. Pepe	[reply]
Re^17: Divide array of integers into most similar value halves by BrowserUk (Patriarch) on Sep 05, 2008 at 00:17 UTC
Great! I'm glad you found a solution to your liking. However, you have a decision to make. It's a trade off between consistently optimal results and consistently predictable timings. tilly's exhaustive search suffers from similar problems to my own attempt at such. Whilst it will eventually find an optimum solution, sometimes quite quickly, quite frequently it will spend an inordinate amount of time finding it. And an inordinate amount of memory too! Below, you will see a comparison of tilly's (latest) and my (latest) code run against the same random data sets with 1e1 thru 1e6 elements. Notice how with 1e3 and 1e5 elements, tilly's code was timed out after a full five minutes where the semi random approach never takes more than ~10 seconds. For the 1e3 case, tilly's code uses over 1GB of ram. For the 1e5 case, it came perilously close to exhausting my 1.5GB of physical memory. c:\test>708290-b -T=300 Testing buk with 10 random values in the range 0 .. 10000 Differ +ence:= 26; took 0.002259 seconds Testing tilly with 10 random value sin the range 0 .. 10000 Differ +ence:= 2; took 0.000975 seconds Testing buk with 100 random values in the range 0 .. 10000 Differ +ence:= 1; took 0.039580 seconds Testing tilly with 100 random values in the range 0 .. 10000 Differ +ence:= 1; took 8.175000 seconds Testing buk with 1000 random values in the range 0 .. 10000 Differ +ence:= 1; took 0.070234 seconds Testing tilly with 1000 random values in the range 0 .. 10000 + ***** timed out after 300 seconds Testing buk with 10000 random values in the range 0 .. 10000 Differ +ence:= 0; took 0.203125 seconds Testing tilly with 10000 random values in the range 0 .. 10000 Differ +ence:= 0; took 0.161583 seconds Testing buk with 100000 random values in the range 0 .. 10000 Diffe +rence:= 1; took 3.093750 seconds Testing tilly with 100000 random values in the range 0 .. 10000 + *** timed out after 300 seconds Testing buk with 1000000 random values in the range 0 .. 10000 Diff +erence:= 0; took 10.578125 seconds Testing tilly with 1000000 random values in the range 0 .. 10000 Diff +erence:= 0; took 21.593750 seconds [download] The memory consumption may not be a problem if your data sets are always smallish, but occasionally even with a relatively small dataset of 100 elements it will take longer that 60 seconds to find a solution, and yet only take 18 seconds for 1 million! c:\test>708290-b -MAX=1e1 Testing buk with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 8; took 0.002118 seconds Testing tilly with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 8; took 0.001475 seconds Testing buk with 100 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.039413 seconds Testing tilly with 100 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 7.955964 seconds Testing buk with 1000 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.088507 seconds Testing tilly with 1000 random valuesin the range 0 .. 1e1 Terminati +ng on signal SIGINT(2) c:\test>708290-b -MAX=1e1 Testing buk with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.000477 seconds Testing tilly with 10 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.001215 seconds Testing buk with 100 random valuesin the range 0 .. 1e1 Differenc +e:= 1; took 0.046773 seconds Testing tilly with 100 random valuesin the range 0 .. 1e1 c:\test>708290-b -MAX=1e3 Testing buk with 10 random valuesin the range 0 .. 1e3 Differenc +e:= 7; took 0.001884 seconds Testing tilly with 10 random valuesin the range 0 .. 1e3 Differenc +e:= 7; took 0.001118 seconds Testing buk with 100 random valuesin the range 0 .. 1e3 Differenc +e:= 1; took 0.039995 seconds Testing tilly with 100 random valuesin the range 0 .. 1e3 + *** timed out after 60 seconds Testing buk with 1000 random valuesin the range 0 .. 1e3 Differenc +e:= 1; took 0.078125 seconds Testing tilly with 1000 random valuesin the range 0 .. 1e3 + *** timed out after 60 seconds Testing buk with 10000 random valuesin the range 0 .. 1e3 Differenc +e:= 0; took 0.203125 seconds Testing tilly with 10000 random valuesin the range 0 .. 1e3 Differenc +e:= 0; took 0.183086 seconds Testing buk with 100000 random valuesin the range 0 .. 1e3 Differen +ce:= 1; took 2.879401 seconds Testing tilly with 100000 random valuesin the range 0 .. 1e3 + ***** timed out after 60 seconds Testing buk with 1000000 random valuesin the range 0 .. 1e3 Differe +nce:= 0; took 8.453125 seconds Testing tilly with 1000000 random valuesin the range 0 .. 1e3 Differe +nce:= 0; took 17.265625 seconds [download] I appreciate your sentiments elsewhere that tilly's code is reliable--that's why I chose it to verify my own against:)--but being exhaustive and deterministic makes it easier to test and debug. The non-determinism of a random (genetic) algorithm makes it much harder to test. That's why I've spent the last two days attempting to get a reproducible random shuffle. But, for NP-hard problems, Genetic Algorithms have one huge advantage. You can specify how long you are prepared to ever wait (in terms of either iterations or best solution within a time limit) for a solution. That's much harder to arrange with an exhaustive solution. There is another possibility. You can combine the two approaches. You start off using the exhaustive approach and either through a time limit, or (and you'd have to consult tilly for details), through some heuristic guesstimate of how long it is likely to take for the given data set, switch to using the genetic algorithm. Perhaps passing the best found so far from the exhaustive attempt into the genetic to see if it can improve it within some specified number of iterations. In any case, whichever approach meets your needs, thank you for presenting an extremely interesting (if at times, frustrating:) problem. That's only the fourth or fifth GA I've ever attempted, and they are frustratingly hard to verify. I learnt one big thing from your feedback. Using small datasets (3 or 4 elements) is a far better testing strategy that large arrays of random values! Three nested loops cycling through all the possible permutations of 3 x 0 .. 99, would be trivial and fast to verify through brute force, but (as your feedback showed), stands to highlight bugs very quickly. Obvious now I seen it in action. But hey! We're never too old to learn something new. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]