in reply to Re: Average Price Algorithm
in thread Average Price Algorithm

Limbic-Region suggested I spend time comparing that algorithm with his (above) and after consideration I think there is a much better algorithm than the one I proposed above.

Limbic-Region's algorithm I'm nearly sure is provably optimal, [update: provided it is adjusted to take into account relative factorization - see below]. It insures that the largest deviations always end up in the largest baskets where they are least likely to have a significant impact. Also his algorithm uses the values with the smallest deviation from the mean first. This insures that a bucket whose size prevents an above-mean deviation from being matched with a below-mean deviation will always have the smallest possible unmatched deviation, and hence the bucket average closest to the sample mean.

On the other hand the algorithm proposed by Limbic-Region does not take full advantage of the histographic information so needs to do a binary search for each smallest deviation item, not to mention a lot of testing for best fit. This, I think can be avoided by ranking values by their distance from the mean before we begin filling the buckets.

The code for a more histographic aware version of Limbic-Region's algorithm is posted below (without correction for relative factorization). The code has been tested on Debian-Etch/Perl 5.8. For demo purposes it prints out the results of allocating 6000 items into 3 buckets for three distributions: (a) a spike with all values are at the mean (b) a symmetrical distribution, i.e. no skew and (c) a seriously skewed distribution. I've also included a demo run for the original poster's distribution.

BTW, the algorithm below is 0(N) where N=number of items to allocate to buckets. If I'm right about Limbic-Region's strategy being provably optimal, then this problem is very far from NP complete.

use strict; use warnings; sub demoAllocation($$$); demoAllocation ("Distribution: all at mean" , {a=>1000,b=>2000,c=>3000} , { '6.0' => 6000 } ); demoAllocation ("Distribution: unskewed" , {a=>1000,b=>2000,c=>3000} , { '3.0' => 300, '4.0' => 600, '5.0' => 700, '5.5' => 900 , '6.0' => 1000 , '6.5' => 900, '7.0' => 700, '8.0' => 600, '9.0' => 300 } ); demoAllocation ("Distribution: skewed" , {a=>1000,b=>2000,c=>3000} , { '3.0' => 4000, '12.0' => 2000 } ); demoAllocation ("Distribution: Original poster" , {A=>65, B=>12, C=>24, D=>19, E=>30} , {'93.8' => 5, '93.81' => 20, '93.82' => 10 , '93.83' => 15, '93.84' => 25, '93.85' => 5 , '93.87'=>20, '94.0' => 5, '94.1' => 35 , '94.2'=> 10 } ); #------------------------------------------------------------ sub demoAllocation($$$) { my ($sDescription, $hBuckets, $hFrequency) = @_; print "$sDescription\n"; my ($dAvg, $hAllocation) = allocate($hBuckets, $hFrequency); foreach my $sId (sort keys %$hAllocation) { my $hItems = $hAllocation->{$sId}; my $dSum = 0; my $iCount = 0; my ($dBucketAvg, $dDeviation); print "$sId:"; foreach my $dValue (sort keys %$hItems) { my $iFreq = $hItems->{$dValue}; printf "\t%s \@ \$%.2f\n", $iFreq, $dValue; $dSum += $dValue*$iFreq; $iCount += $iFreq; } $dBucketAvg = $dSum/$iCount; $dDeviation = $dBucketAvg - $dAvg; printf "\tbucket avg: \$%.2f, deviation: \$%.3f\n\n" , $dBucketAvg, $dDeviation; } print "\n"; } #------------------------------------------------------------ sub allocate($$) { my ($hBuckets, $hFrequency) = @_; #calculate deviations from the mean my $dAvg=calcWeightedAvg($hFrequency); my ($iFreqAvg, $aAbove, $aBelow) = calcDeviations($hFrequency, $dAvg); #sort buckets by size: smallest first my @aBuckets = sort { $hBuckets->{$a} <=> $hBuckets->{$b} } keys %$hBuckets; #allocate items to buckets, smallest first my %hAllocations; my $iFirstAbove = 0; my $iFirstBelow = 0; foreach my $sId (@aBuckets) { my $iSize = $hBuckets->{$sId}; $hAllocations{$sId} = fillBucket($iSize, $dAvg, \$iFreqAvg , $aAbove, \$iFirstAbove , $aBelow, \$iFirstBelow); } return ($dAvg, \%hAllocations); } #------------------------------------------------------------ # SUPPORTING FUNCTIONS - alphabetical order #------------------------------------------------------------ sub calcDeviations($$) { my ($hFrequency, $dAvg) = @_; my @aAbove; my @aBelow; my $iFreqAvg = 0; #calculate deviations from mean while (my ($dValue,$iFreq) = each(%$hFrequency)) { if ($dValue == $dAvg) { $iFreqAvg+=$iFreq; next; } my $dDeviation = $dValue - $dAvg; if (0 < $dDeviation) { push @aAbove, [ $dDeviation, $dValue, $iFreq ]; } else { push @aBelow, [ -$dDeviation, $dValue, $iFreq ]; } } #sort with smallest deviations first return ( $iFreqAvg , [ sort { compareDeviations($a,$b) } @aAbove ] , [ sort { compareDeviations($a,$b) } @aBelow ] ); } #------------------------------------------------------------ sub compareDeviations($$) { my ($x, $y) = @_; return $x->[0] <=> $y->[0]; } #------------------------------------------------------------ sub calcWeightedAvg($) { my $hFrequency = shift @_; my $dSum=0; my $iCount=0; while (my ($dValue,$iFreq) = each(%$hFrequency)) { $dSum+=$dValue*$iFreq; $iCount+=$iFreq; } return $dSum/$iCount; } #------------------------------------------------------------ sub fillBucket($$$$$$$) { my ($iNeeded, $dAvg, $rFreqAvg , $aAbove, $rFirstAbove , $aBelow, $rFirstBelow) = @_; #take items that are at the mean, if we can if ($iNeeded <= $$rFreqAvg) { $$rFreqAvg-=$iNeeded; return { $dAvg => $iNeeded }; } my $hItems = {}; my $aUp = $aAbove->[$$rFirstAbove]; my $aDown = $aBelow->[$$rFirstBelow]; my $dNetDeviation = 0; if (0 < $$rFreqAvg) { $iNeeded -= $$rFreqAvg; $hItems->{$dAvg} = $$rFreqAvg; $$rFreqAvg = 0; } #take whatever creates the smallest net deviation # [0] deviation # [1] value # [2] frequency while ($iNeeded > 0) { my $bUseUp = 0; if ($aUp) { if ($aDown) { my $dNetUp = $dNetDeviation + $aUp->[0]; my $dNetDown = $dNetDeviation - $aDown->[0]; if (abs($dNetUp) < abs($dNetDown)) { $bUseUp = 1; $dNetDeviation = $dNetUp; } else { $bUseUp = 0; $dNetDeviation = $dNetDown; } } else { $bUseUp = 1; } } else { $bUseUp = 0; } if ($bUseUp) { $hItems->{$aUp->[1]} ++; $aUp->[2]--; $$rFirstAbove++ unless $aUp->[2]; $aUp = $aAbove->[$$rFirstAbove]; } else { $hItems->{$aDown->[1]} ++; $aDown->[2]--; $$rFirstBelow++ unless $aDown->[2]; $aDown = $aBelow->[$$rFirstBelow]; } $iNeeded--; } return $hItems; }

Replies are listed 'Best First'.
Re^3: Average Price Algorithm
by GrandFather (Saint) on Feb 01, 2009 at 01:21 UTC

    The test sample:

    demoAllocation ( "Distribution: nasty", {a => 30, b => 30}, { '1.0' => 40, '2.0' => 10, '4.0' => 10, } );

    Prints:

    Distribution: nasty a: 17 @ $1.00 10 @ $2.00 3 @ $4.00 bucket avg: $1.63, deviation: $-0.033 b: 23 @ $1.00 7 @ $4.00 bucket avg: $1.70, deviation: $0.033

    However inspection of the original data suggests that simply allocating half of each original parcel to the two groups gives a perfect result.


    Perl's payment curve coincides with its learning curve.
      Many thanks for the example!

      GrandFather, I'm wondering if you would share your thoughts on

      • why this particular example fails to find the optimal solution?
      • whether there is a solution that would consistently address the reason for failure?

      It strikes me that it has something to do with relative factorization. For example, had the example been scaled down by 10 so that the number of buckets was relatively prime vis a vis the number of items assigned to each value, then the strategy of minimizing net deviation would have succeeded in finding the optimal result. Conversely, had the code I posted taken relative factorization into account, it would have found the optimal solution.

      Are there other examples you can think of, for which taking into account relative factorization would not help?

      Best, beth

Re^3: Average Price Algorithm
by Limbic~Region (Chancellor) on Feb 02, 2009 at 19:11 UTC
    ELISHEVA,
    Thank you for taking the time to code this up. I agree, and pointed out, that there were optimizations to be made from my approach as I did it hastily. I am not at all convinced that it produces optimal results for all data sets. I think it is a good heuristic approach given the requirements (runtime of less than 3 seconds with small margin of error).

    In order to disprove that it doesn't always produce optimal results, I don't need the rigor of a formal proof - I just need to come up with an example. If you are interested in such an example, it shouldn't be too difficult. A small enough data set should allow all possible partitions and combinations which can then be compared against the results of the heuristic algorithm. If you want, I can code that up?

    Cheers - L~R

      That would be great! Feel free to borrow as needed (or not) from the code I contributed if that would save time.

      beth
        ELISHEVA,
        It would appear GrandFather has already provided an an example.
        40 units at value of 1 10 units at value of 2 10 units at value of 4 60 units total with a value of 100 and an average of 1 + 2/3 Problem: Divide into 2 groups of 30 Group A & B both have 30 units with exactly 1.67 (1 + 2/3) 20 units of 1 5 units of 2 5 units of 4
        Now assuming you don't have any bugs in your code and GrandFather used it correctly, it can be shown that it does not always produce optimal results. Update: To give you an idea of how I was going to find such a scenario - see Re: Permutation of groups (and the root thread).

        Cheers - L~R