in reply to Average Price Algorithm

Note: Significantly updated, 2008-02-01, 12:30am ish

Mathematically this is a bit different from a packing problem. A packing problem is a least upper/greatest lower bound problem. This problem centers on distributions and deviations. It might be better to think of it as a matching problem.

To keep the bucket closest to the mean, whenever you give a bucket something with the price above the mean you also have to give it something below the mean. Originally, I thought this would be best accomplished with a two phase algorithm: first get rid of the items that let every bucket have the same average as the overall sample mean. Next get rid of the oddballs. That algorithm is described here:

I think if I were solving this problem, I might try something like this, for a start:

Phase I - for all items such that quantity > number of buckets

  1. pick the item at the mean (if it exists)
  2. allocate the same number of that item to each bucket and put aside the remainder
  3. pick the next pair above and below the mean. Calculate the average deviation of the values. If it exceeds the threshold above the mean, pick the next value below the mean. If it exceeds the threshold below the mean, pick the next value above the mean. Repeat until you find a combination of values above and below the mean that fall within the threshold.

    If your goal is minimizing distance from the mean rather than keeping that distance within a threshold, continue until you run out of values and pick the combination that had a minimal deviation.

  4. allocate the same number of each combination to each bucket and put aside any oddballs.
  5. repeat until you run out of things to allocate evenly to the buckets

At this point all should have exactly the same average and exactly the same deviation from the mean. If there were no left overs, then this would represent the best possible allocation. But of course, there will be left overs, so the next step is to get rid of the oddballs.

Phase II - get rid of odd balls

This is identical to phase I except that instead of allocating each combination to all buckets at once, each sub-threshold combination is given to only one bucket. So if choosing one value above and below the mean is below the threshold, then two items are allocated to bucket X and only bucket X. Then the whole process of finding a subthreshold combination is repeated for bucket Y, and so on until each bucket has been given a combination.

On the off chance that there is some cyclic pattern to the distribution of deviations, it is probably a good idea to use a random order when filling the buckets. So for the first round one might fill buckets 3-1-2. Next 2-1-3. Next 1-2-3. etc.

The other major risk in this algorithm is the "odd man out problem" - it may be that the deviations are distributed in such a way that the last bucket in the last round ends up with an average deviation that is really off the charts.

There are probably ways around this that involve calculating deviations of deviations or surveying the distribution of deviations - this kind of problem is more likely to happen in a badly skewed distribution. But this is beginning to spin my brain, so I'll stop.

Best, beth

Update: Whoops! Forget the second part of the problem: A gets 19, B gets 30 and so on. For that, you could simply start skipping a bucket when it reaches its max number of items. So if you start with 5 buckets and one gets filled, then from then on you work only with the four remaining buckets.

Note: the order in which people drop out can't be randomized so the person with the highest number of items is also at the greatest risk of ending up with the "odd man out" deviations. If, by any chance, you are allocating shares bought at different times for a group of investors this may or may not be "ok".

Translated into financial terms, he who has the greatest chance of being the holder of the "odd man out" has the greatest risk - i.e. the greatest liklihood that he or she has some significant upward or downward unpredictability in his/her outcome. Risk can have a monetary value and/or contractual constraints. If that is the case, you may need to figure out a way to determine the probability that someone will be the odd man out or else eliminate the issue by randomizing who is the odd man out, but as I said, that part of the problem is making my head spin...

Update again: reworded above paragraph.

I've since reconsidered my algorithm, see note below. Best, beth

Replies are listed 'Best First'.
Re^2: Average Price Algorithm
by ELISHEVA (Prior) on Jan 31, 2009 at 22:45 UTC

    Limbic-Region suggested I spend time comparing that algorithm with his (above) and after consideration I think there is a much better algorithm than the one I proposed above.

    Limbic-Region's algorithm I'm nearly sure is provably optimal, [update: provided it is adjusted to take into account relative factorization - see below]. It insures that the largest deviations always end up in the largest baskets where they are least likely to have a significant impact. Also his algorithm uses the values with the smallest deviation from the mean first. This insures that a bucket whose size prevents an above-mean deviation from being matched with a below-mean deviation will always have the smallest possible unmatched deviation, and hence the bucket average closest to the sample mean.

    On the other hand the algorithm proposed by Limbic-Region does not take full advantage of the histographic information so needs to do a binary search for each smallest deviation item, not to mention a lot of testing for best fit. This, I think can be avoided by ranking values by their distance from the mean before we begin filling the buckets.

    The code for a more histographic aware version of Limbic-Region's algorithm is posted below (without correction for relative factorization). The code has been tested on Debian-Etch/Perl 5.8. For demo purposes it prints out the results of allocating 6000 items into 3 buckets for three distributions: (a) a spike with all values are at the mean (b) a symmetrical distribution, i.e. no skew and (c) a seriously skewed distribution. I've also included a demo run for the original poster's distribution.

    BTW, the algorithm below is 0(N) where N=number of items to allocate to buckets. If I'm right about Limbic-Region's strategy being provably optimal, then this problem is very far from NP complete.

      The test sample:

      demoAllocation ( "Distribution: nasty", {a => 30, b => 30}, { '1.0' => 40, '2.0' => 10, '4.0' => 10, } );

      Prints:

      Distribution: nasty a: 17 @ $1.00 10 @ $2.00 3 @ $4.00 bucket avg: $1.63, deviation: $-0.033 b: 23 @ $1.00 7 @ $4.00 bucket avg: $1.70, deviation: $0.033

      However inspection of the original data suggests that simply allocating half of each original parcel to the two groups gives a perfect result.


      Perl's payment curve coincides with its learning curve.
        Many thanks for the example!

        GrandFather, I'm wondering if you would share your thoughts on

        • why this particular example fails to find the optimal solution?
        • whether there is a solution that would consistently address the reason for failure?

        It strikes me that it has something to do with relative factorization. For example, had the example been scaled down by 10 so that the number of buckets was relatively prime vis a vis the number of items assigned to each value, then the strategy of minimizing net deviation would have succeeded in finding the optimal result. Conversely, had the code I posted taken relative factorization into account, it would have found the optimal solution.

        Are there other examples you can think of, for which taking into account relative factorization would not help?

        Best, beth

      ELISHEVA,
      Thank you for taking the time to code this up. I agree, and pointed out, that there were optimizations to be made from my approach as I did it hastily. I am not at all convinced that it produces optimal results for all data sets. I think it is a good heuristic approach given the requirements (runtime of less than 3 seconds with small margin of error).

      In order to disprove that it doesn't always produce optimal results, I don't need the rigor of a formal proof - I just need to come up with an example. If you are interested in such an example, it shouldn't be too difficult. A small enough data set should allow all possible partitions and combinations which can then be compared against the results of the heuristic algorithm. If you want, I can code that up?

      Cheers - L~R

        That would be great! Feel free to borrow as needed (or not) from the code I contributed if that would save time.

        beth
Re^2: Average Price Algorithm
by ELISHEVA (Prior) on Feb 02, 2009 at 23:29 UTC
    I was sufficiently frustrated by the discussion in the note above, that I decided to just go ahead and correct the algorithm for situations where there are same sized buckets. The code posted below solves Grandfather's "Nasty distribution" perfectly. It is still O(N) where N is the number of items to allocate.

    The algorithm below is a hybrid of the one proposed by Limbic-Region and the one I originally proposed. Smallest buckets are filled first (as per Limbic-Region), but when we encounter N equal sized buckets we allocate items N (one per bucket) at a time until we can't. Then we go back to allocating items to buckets one by one. This prevents the first of N buckets from hogging all the "good" values.

    I'd still very much appreciate it if someone could find an example or two that doesn't work with this algorithm. Though I'm not convinced that this problem is anything like NP complete, I'm sure the code I've posted has room for improvement.

    Best, beth

    Update 1: fixed bug in code that was originally posted.

    Update 2: (Feb 3, 8:00 UTC) fixed bucket count bug identified below by Grandfather and Limbic~Region. I also made the output scream **ERROR** if this kind of mistake happens again. This is only a bug fix, however. The bucket counts are now right but the distributions discussed below - "Nasty mark II (Grandfather)" and "Limbic~Region" are still suboptimal.

      demoAllocation ( "Distribution: skewed: Nasty mark II (Grandfather)", {a => 30, b => 20, c => 10}, {'1.0' => 30, '2.0' => 12, '4.0' => 6, '8.0' => 12} );

      Prints:

      Distribution: skewed: Nasty mark II (Grandfather) a: 22 @ $1.00 8 @ $8.00 bucket avg: $2.87, deviation: $-0.033 b: 8 @ $1.00 6 @ $2.00 2 @ $4.00 4 @ $8.00 bucket avg: $3.00, deviation: $0.100 c: 6 @ $2.00 4 @ $4.00 bucket avg: $2.80, deviation: $-0.100

      However, allocating 1/2 of each parcel to a, 1/3 to b and 1/6 to c gives a perfect result.


      Perl's payment curve coincides with its learning curve.
      Many thanks again to Grandfather and Limbic~Region! Both for the bug find and the sub-optimal examples.

      Both examples are food for thought. Limbic-Region's distribution is interesting because it is a good example of where looking ahead only to the next largest deviation does not yield an optimal result. The smallest sized buckets can only be optimally filled by ignoring 6.0 even though it is smack on the mean (a deviation of 0). Grandfather's "nasty mark II" shows clearly that least(greatest?) common denominators (i.e. 2) and not just individual bucket size or frequency of bucket size affect the result.

      Best, beth
      ELISHEVA,
      demoAllocation ( "Distribution: Limbic~Region", {a => 3, b => 4, c => 2, d => 2}, {'1.0' => 1, '2.0' => 1, '3.0' => 1, '4.0' => 1, '5.0' => 1, '6.0' + => 1, '7.0' => 1, '8.0' => 1, '9.0' => 1, '10.0' => 1, '11.0' => 1 } ); __DATA__ Distribution: Limbic~Region a: 1 @ $2.00 1 @ $3.00 1 @ $9.00 bucket avg: $4.67, deviation: $-1.333 b: 1 @ $1.00 1 @ $10.00 1 @ $11.00 bucket avg: $7.33, deviation: $1.333 c: 1 @ $5.00 1 @ $6.00 1 @ $7.00 bucket avg: $6.00, deviation: $0.000 d: 1 @ $4.00 1 @ $8.00 bucket avg: $6.00, deviation: $0.000

      It is easy to show that this is not the best result. Consider a perfect distribution:

      • 1, 11, 6
      • 2, 10, 3, 9
      • 7, 5
      • 8, 4

      Cheers - L~R

        It is not just not a best result, it is an invalid result! b wanted 4 units but got 3 and c wanted 2 units but got 3.


        Perl's payment curve coincides with its learning curve.
Re^2: Average Price Algorithm
by ELISHEVA (Prior) on Feb 01, 2009 at 05:35 UTC
    Sorry - firefox double posted this.