Pepe has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am trying to divide an array of integers into the two subarrays whose difference of the sums is the minimum possible. It would be something like having balls with different weights and trying to split them in two groups that will balance best in a two plate scale. For example,

my @array = (8,14,32,29);
should be divided into
@subarray1 = (8,32) # total value 40
@subarray2 = (14,29) # total value 43

or

my @array = (7,10,12,15,40);
should be divided into
@subarray1 = (7,10,12,15) # add up to 44
@subarray2 = (40) #add up to 40

Any suggestions? I don't seem to find the right method, no matter how hard I try.
Thanks a lot in advance.

Pepe
  • Comment on Divide array of integers into most similar value halves

Replies are listed 'Best First'.
Re: Divide array of integers into most similar value halves
by moritz (Cardinal) on Sep 01, 2008 at 19:35 UTC
    Are your lists big? And do you really need the best solution, or is a good approximation sufficient?

    I think that Dominus' book Higher Order Perl contains an example for something similar, and if I remember correctly it just went through all possible solutions, which doesn't scale very well for long lists.

    Update: Here's a very simple approximative solution (after finding out that it might be good enough):

    #!/usr/bin/perl use strict; use warnings; use List::Util qw(sum); divide(8,14,32,29); divide(7,10,12,15,40); sub divide { my @array = reverse sort { $a <=> $b} @_; my $target = sum(@array) / 2; my @result; my $current = 0; for (@array) { if ($current + $_ <= $target) { push @result, $_; $current += $_; } } print "@result\n"; print "Target: $target; Result: $current\n"; } __END__ 32 Target: 41.5; Result: 40 7 15 12 Target: 42; Result: 34

    If that's not good enough you can iterate over all pairs of values from distinct sets and see if the reached value improves. If it does, swap these two elements.

    (second update: fixed sort. FunkyMonk++)

      Thanks a lot
      I believe it works just fine. Thanks again
      My lists are not really big. Almost never bigger than 100.
        For a problem with complexity O(2**$n) a value of $n == 100 is enough to make your computer work until either he or your dies. That's why I asked if an approximation is enough.
Re: Divide array of integers into most similar value halves
by Skeeve (Parson) on Sep 01, 2008 at 19:48 UTC

    You know that this problem is np-complete? It's well known as the partition problem.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      I read about it, but the Partition Problem just tells you if the list of numbers can be partitioned into 2 halves that have the same sum.
      I'm not really interested into checking that. I don't mind them having the same or different sum. I only want the best possible partition.
      Thanks anyway.

        Strange! The german wikipedia page states "Gesucht wird eine Aufteilung dieser Zahlen auf zwei Haufen, so dass die Differenz der Summen der Zahlen in den beiden Haufen möglichst klein ist." (Find a partition such that the diffference of the sum of each heap is minimal). And this is exactly what you're after.


        s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
        +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
        I read about it, but the Partition Problem just tells you if the list of numbers can be partitioned into 2 halves that have the same sum. I'm not really interested into checking that. I don't mind them having the same or different sum. I only want the best possible partition. Thanks anyway.
        Note that the problem you are describing is (at best) just as hard as the original. Namely, if you can solve your problem, then you can solve the problem of finding a partition into equal halves by finding the best partition, and noting whether the difference in sums is 0.

        UPDATE: Oops, sorry, both moritz and jethro pointed that out already.

        If you had an algorithm that found the best possible partition and there is one with difference 0 (for some array of integers) then your algorithm would find that one, otherwise you would find a difference>0. So your algorithm would be a solution to the partition problem.
        But to find out if your solution is the best partition (and it's value is not sum/2) you have to check if there is a possible better partition, which means checking if the partition with the value sum/2 exists. Thus you have to solve the partition problem. D'oh.
Re: Divide array of integers into most similar value halves
by FunkyMonk (Bishop) on Sep 01, 2008 at 19:46 UTC
    How about...

    • @numbers = reverse sort @numbers
    • $target = sum(@numbers)/2
    • loop forever {
    •   find largest $number in @numbers that is smaller than $target
    •   exit loop if not found
    •   add that number to @bucket
    •   remove that number from @numbers
    •   subtract that number from $target
    • }
    • return @bucket and @numbers

    and in perl 5.10...

    use List::Util qw(sum); use List::MoreUtils qw(first_index); #my @numbers = map { rand() * 100 } 1 .. 5; my @numbers = (8,14,32,29); my @b = split_evenly( \@numbers ); say "First container: sum(@{$b[0]}) = ", sum @{$b[0]}; say "Second container: sum(@{$b[1]}) = ", sum @{$b[1]}; sub split_evenly { my @numbers = reverse sort { $a <=> $b } @{+shift}; my $target = sum(@numbers) / 2; say "Target is $target"; my @b; while ( 1 ) { my $index = first_index { $_ <= $target } @numbers; last if $index < 0; $target -= $numbers[$index]; push @b, splice @numbers, $index, 1; } return \@b, \@numbers; }

    replace the says with print for perl < 5.10.

    Tested, but not exhaustively. Use at your own risk. etc.

    Update: changed the output so the lists are displayed.


    Unless I state otherwise, all my code runs with strict and warnings
      Unfortunately, this algorithm can fail. I didn't turn up an example with small numbers after some thinking, so here's one that I stole from an article transitively linked by Skeeve: the set {62, 83, 121, 281, 486, 734, 771, 854, 885, 1003} has a perfect partition (namely, {62, 83, 121, 486, 885, 1003}), but the greedy algorithm that you suggest returns {1003, 885, 734}, which has a defect of 18 (not 32, I think, despite the article).

      The algorithm I used to find that match is based on my imperfect recall of the one from Dominus's book mentioned by moritz. I'm sure it can be made much more elegant (for example, by not hard-wiring @list; and, less trivially, by not passing in $so_far_ref—I was just using it to print the diagnostics), but I think that this works:

        I was aware that my "solution" wasn't perfect, but the OP suggested that near enough was good enough.

        Everything in life is a comprimise :-)

        Thanks for noting. I'm still trying it out the first algorithm in my data. I believe is close enough for what I want.
        Of course I'm open to any improvements...
        Thanks a lot.
      Hey FunkyMonk,
      sorry to bother you again, but I've been extensively testing the script you posted and works most of the time but fails in cases as:
      @array = (41,37,37,43)
      returning
      @array1=(43)
      @array2=(41,37,37)
      Any ideas?

      Thanks in advance

      Pepe
        Take a look at moritz's orginal reply and its followups. That's the difference between doing it correctly (their discussion) and an evil, dirty, quick hack (my version).

        It all depends with how bad "good enough" can be :)

        My program will never allow the first container to hold more than $target. All the other numbers go into the second. With (41,37,37,43) as input, 43 goes into the first container, and none of the other numbers will fit, so all the rest go into the second container.

        Tweaking the comparison in first_index will allow this dataset to be divided more evenly, but will produce worse solutions in some cases. For example, making this change

        first_index { $_ <= $target*1.1 } # allow container to overflow by 10%

        produces (I changed the output format slightly):

        Original numbers: (41 37 37 43) Target is 79 First container: sum(43 37) = 80 Second container: sum(41 37) = 78

        The code is fast, so there's nothing to stop you dividing your numbers twice using the untweaked and tweaked comparison and choosing the answer you like best.

        But, all that does is make a dirty, evil, quick hack dirtier, more evil and slower. You'll still be able to find cases where it fails.

        The only real solution is to do it properly.


        Unless I state otherwise, all my code runs with strict and warnings
      So far works for me,
      I have to test it in 20.000 arrays, so I will write some code to prove that it works fine, but it seems to do pretty well.

      Thanks a lot It's been great help!!!!
Re: Divide array of integers into most similar value halves
by BrowserUk (Patriarch) on Sep 02, 2008 at 05:41 UTC

    Note: This only works with positive integers!

    Try this. It uses a semi-random approach with a specifiable limit to bound the attempts it makes. It seems to do a pretty good job of finding the optimium solution most of the time. It will occasionally miss, but when it does, it still delivers a close to optimium result:

    #! perl -slw use strict; use List::Util qw[ sum shuffle ]; sub partition { my( $limit, $aRef ) = @_; my @in = sort{ $a <=> $b } @$aRef; my $target = sum( @in ) >> 1; my( $best, @best ) = 9e99; my $soFar = 0; my @half; for( 1 .. $limit ) { #print "$soFar : [@half] [@in] [@best]"; <>; $soFar += $in[ 0 ], push @half, shift @in while $soFar < $targ +et; return( \@half, \@in ) if $soFar == $target; my $diff = abs( $soFar - $target ); ( $best, @best ) = ( $diff, @half ) if $diff < $best; $soFar -= $half[ 0 ], push @in, shift @half while $soFar > $ta +rget; return( \@half, \@in ) if $soFar == $target; $diff = abs( $soFar - $target ); ( $best, @best ) = ( $diff, @half ) if $diff < $best; @in = shuffle @in; } my %seen; $seen{ $_ }++ for @best; ## return \@best, [ grep !$seen{ $_ }--, @$aRef ]; ## Fix duplicate +s bug return \@best, [ grep{ !exists $seen{ $_ } or !$seen{ $_ }-- } @$a +Ref ]; } for ( [ 62, 83, 121, 281, 486, 734, 771, 854, 885, 1003 ], [ 7,10,12,15,40 ], [ 8, 14, 32, 29 ], [ 41, 37, 37, 43 ], [ 99, (1)x99 ], [ 1 .. 99 ], [ map $_*2+1, 1 .. 50 ], [ map int( rand 1000 ), 1 .. 100 ], ) { my( $a1, $a2 ) = partition( 1e2, $_ ); my( $t1, $t2 ) = ( sum( @$a1), sum( @$a2 ) ); print "\n@$a1 := ", $t1; print "@$a2 := ", $t2; print "Diff: ", abs( $t1 - $t2 ); }

    The limit is specified as an input parameter. Above I'm using a hardcoded limit of 100, but that might be better specified in terms of the number of input values. Say, @input * 10 might be a starting point. The longer you're prepared to wait, the better solution it will generally find. It will return quickly if a perfect solution is found.

    Testing is limited to what you see above. Anyone know of any particularly hard cases?

    c:\test>708290.pl 734 771 854 281 := 2640 121 62 83 885 486 1003 := 2640 Diff: 0 7 10 12 15 := 44 40 := 40 Diff: 4 8 32 := 40 14 29 := 43 Diff: 3 37 41 := 78 37 43 := 80 Diff: 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 +1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 + 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 := 99 99 := 99 Diff: 0 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 +30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 +53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 := 2475 94 1 72 99 2 81 96 86 79 74 91 85 92 89 76 71 98 88 93 97 77 84 78 90 +73 75 95 3 4 87 80 83 82 := 2475 Diff: 0 63 65 67 69 71 73 101 79 99 11 75 91 81 87 95 5 37 13 41 77 := 1300 27 29 3 33 21 53 85 61 23 17 47 49 45 51 35 89 7 55 43 57 59 15 25 31 +39 9 93 83 97 19 := 1300 Diff: 0 248 28 269 882 847 389 242 192 484 519 123 259 528 363 410 722 110 34 +501 922 695 622 831 48 464 234 548 815 702 725 403 629 29 579 365 81 +294 47 927 879 639 470 902 318 313 736 228 230 750 38 620 77 702 426 +582 := 25020 923 457 738 7 847 234 696 334 200 989 231 744 528 402 619 485 791 272 +261 533 525 298 724 615 253 71 957 836 904 105 575 929 537 802 791 94 +1 437 538 740 165 229 528 867 609 753 := 25020 Diff: 0

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Sorry to disappoint you BrowserUk,
      your sub does not work if a number is repeated in the array.
      Try @array = (33,33,37);
      The solution is good in every other case, though. Only need that to be fixed.
      Thanks a lot for your effort, anyways.
      Pepe

        Thanks. That is indeed a bug. The (lightly tested) fix is to change the last line of the sub to:

        return \@best, [ grep{ !exists $seen{ $_ } or !$seen{ $_ }-- } @$a +Ref ];

        Basically, because the sub only stores one best partition, when it runs out of iterations, it needs to filter the input array to generate teh other half. The filter was a sloppy variation on a theme that works for other purposes but not this. I think the above corrects it, but I will need to give more thought once my brain has awoken.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Divide array of integers into most similar value halves (good enough)
by tye (Sage) on Sep 02, 2008 at 05:42 UTC

    Yay for fancy comp sci terms. It is NP-complete! (or is it "NP hard"?)

    Given 100 numbers, this code finds a likely-optimal solution in about 1 second. If you are unlucky, it can spend a very long time after that not finding any better solutions. :)

    #!/usr/bin/perl -w use strict; sub halfWeights { my @weights= sort { $b <=> $a } @_; my $dist= 0; $dist += $_ for @weights; $dist /= 2; my $best= $dist; my @sol; my @idx= ( 0 ); while( 1 ) { $dist -= $weights[$idx[-1]]; for( abs($dist) ) { if( $_ < $best ) { $best= $_; @sol= @idx; printf STDERR "%+g: %s\n", $_, join( ", ", @weights[@i +dx] ); return @weights[ @sol ] if( 0 == $_ ); } } if( 0 < $dist ) { push @idx, 1 + $idx[-1] } else { $dist += $weights[ $idx[-1]++ ]; } while( @weights <= $idx[-1] ) { pop @idx; return @weights[ @sol ] if( 1 == @idx ); $dist += $weights[ $idx[-1]++ ]; } } } @ARGV= ( 100 ) if( ! @ARGV ); push @ARGV, $ARGV[0]*$ARGV[0] if( 1 == @ARGV ); if( 2 == @ARGV ) { my $cnt= shift @ARGV; my $max= shift @ARGV; push @ARGV, 1 + int rand($max) while( @ARGV < $cnt ); } elsif( 3 == @ARGV ) { my $cnt= shift @ARGV; while( @ARGV < $cnt ) { push @ARGV, $ARGV[-2] + $ARGV[-1]; } } halfWeights( @ARGV );

    Note that if you have fractional weights, then the way that things are computed will likely cause a growing accumulation of errors that won't impact the likely-optimal solution much but could cause serious misrepresentation of how close other solutions really are if you wait the hundreds of years and more that it could spend contemplating them.

    - tye        

Re: Divide array of integers into most similar value halves
by swampyankee (Parson) on Sep 01, 2008 at 21:08 UTC
Re: Divide array of integers into most similar value halves
by psini (Deacon) on Sep 01, 2008 at 20:22 UTC

    If you want the best solution, 100 values is not a small list for, at first approximation you need about (n/2)! tries to check all the possible partitions.

    You could use a couple of tricks to reduce the range of solutions:

    • Say N the sum of all the elements, divided by two, and floored; say S(X) the sum of the elements of a given subset X of your list. Your problem can be reduced to find the subset P with the minimum ABS(S(P)-N) within all the possible subsets. The second set Q is obviously given by the difference between the original list and P.
    • Say M the greatest element of the list. You can safely assume that it is part of the solution (all elements are) so you can take it off from the list, decrease N by M and apply the previous point to the shortened list and the reduced N. If your list has a great dispersion, this can lead to a significant reduction of the number of tests required.

    Rule One: "Do not act incautiously when confronting a little bald wrinkly smiling man."

Re: Divide array of integers into most similar value halves
by GrandFather (Saint) on Sep 01, 2008 at 22:17 UTC

    What is the bigger picture? Generally a good enough solution is a strong function of what you need the solution for. In this case a generally solution may be impractical, but a good enough solution can only be determined in the context of the problem context.


    Perl reduces RSI - it saves typing
Re: Divide array of integers into most similar value halves
by johndageek (Hermit) on Sep 02, 2008 at 14:38 UTC
    Jsut my simple attempt:
    #!/usr/bin/perl ## arrays to test #@aoi = (1,33,2,5,6,2,9999,1,555,333,654,8,1,234,0,765,2,3,446,753); #@aoi = (1,33,2,5,6,2,999,1,555,333,654,8,1,234,0,765,2,3,446,753); #@aoi = (1,1,33,2,5,6,2,999,1,555,333,654,8,1,234,0,765,2,3,446,753); #@aoi = (1,1,33,2,5,6,2,999,8,1,555,333,654,8,1,234,0,765,2,3,446,753) +; @aoi = (2406,1,1,33,2,5,6,2,999,8,1,555,333,654,8,1,234,0,765,2,3,446, +753); # working variables @arr1 = (); $sum1 = 0; @arr2=(); $sum2 = 0; # sort list @saoi = sort { $a <=> $b} @aoi; ## start with highest value working downwards, pushing onto array cont +aining ## the lowest sum. SHould give you the least available difference betw +een array sums for ($t=$#saoi;$t>-1;$t--){ if ($sum2 > $sum1){ $sum1 = $sum1 + $saoi[$t]; push @arr1,$saoi[$t]; }else{ $sum2 = $sum2 + $saoi[$t]; push @arr2,$saoi[$t]; } } $diff = $sum2 - $sum1; print "$sum2 - $sum1 = $diff\n";

    Enjoy!
    Dageek

      Dageek,

      this is a great idea!!! I'm gonna try it.
      By adding the next value to the smallest group the result should be close to optimum.
      Also simple and fast

      Thanks a lot
      Pepe
        It'll fail for e.g. (3,3,2,2,2) and other sets where the top two numbers are odd, the rest even, and the ideal split is even. (and other cases too, e.g. (10,10,4,4,4,4,4))
Re: Divide array of integers into most similar value halves
by tilly (Archbishop) on Sep 02, 2008 at 05:58 UTC
    Many comments to the contrary notwithstanding, the odds are very good that your actual problem is not really NP-complete. My discussion of why not is kind of long, though, so I posted it as a meditation at NP-complete sometimes isn't.
Re: Divide array of integers into most similar value halves
by praveeperl (Initiate) on Sep 02, 2008 at 10:59 UTC
    my @array = (8,14,32,29); @new = sort{$a <=> $b}@array;#sorting $i = 'first';#set flag foreach (@new){ if ($i eq 'first'){ push (@subarray1, $_); $i = 'second'; }else{ push (@subarray2, $_); $i = 'first' } } output: @subarray1 = (8,32) # total value 40 @subarray2 = (14,29) # total value 43
      This is not a general solution. Instead of solving the problem, it puts every second item into each set.

      That happens to work in this case by chance, but you shouldn't rely on the ordering of the input data.

Re: Divide array of integers into most similar value halves
by bduggan (Pilgrim) on Sep 02, 2008 at 19:01 UTC
    This sounds like a good dynamic programming exercise :
    use List::Util qw/sum/; use Memoize; memoize('sum_to'); use strict; sub sum_to { # given : $n, @ary # return : two array refs, the first one summing to $n # the second one has the remaining elements # return nothing if it's impossible my $n = shift; my @ary = @_; return [[],\@ary] if $n==0; return if ($n<0 || @ary==0); for my $elem (@ary) { my %seen; my @left = grep { $_ != $elem || $seen{$_}++ } @ary; if (my $found = sum_to($n-$elem,@left)) { return [ [ $elem, @{ $found->[0] } ], [ @{ $found->[1] } ] ]; } } return; } my @nums = map int rand 1000, 1..100; my $sum = sum @nums; my $target = int ( $sum / 2); while ($target > 0) { my $found = sum_to($target,@nums) or next; my $first = $target ." == ". join '+', @{ $found->[0] }; my $second = ($sum-$target)." == ". join '+', @{ $found->[1] }; print join "\n",$first, $second; die "there was a problem" unless eval $first && eval $second; last; } continue { $target--; }
    I think this might fail in the case where there are two ways to sum to a given number, and only one of those ways is right -- to account for this, sum_to shd probably return an array of solutions.
Re: Divide array of integers into most similar value halves
by FunkyMonk (Bishop) on Sep 02, 2008 at 21:44 UTC
    Just out of interest, why do you want it? What's the real application?
      I'm not sure how to explain this... I have DNA sequence alignments, something like sequences of letter piled up so the similar bases (letters) are in top of each other. Then, each sequence has assigned a quality scores.
      Something like:

      A(45) C(44) T(44) A(45)
      A(31) T(31) T(35) A(37)
      A(50) C(52) A(52) A(52)

      I'm looking for variation in those sequences, but in order to find variation that is reliable I've calculated that I need to find groups of sequences with the same base that add up to at least quality 50.
      So in the previous example the second column C(44+52=96)/T(31) does not have enough quality to be considered (only one of the bases reaches the required quality), but the third column T(44+35=79)/A(52) does.
      With your script I was trying to estimate how many of those positions in the alignment can I even consider analyzing, ie. how many of those positions (array of quality scores) can be separated in two subarrays that pass the threshold.
      Sorry if it's not clear.

      Pepe
        That was clear enough, thanks.

        I normally switch off as soon as I've read "CATG" 20 times over :-)