in reply to Re: Need technique for generating constrained random data sets
in thread Need technique for generating constrained random data sets
This is similar in some respect to the algorithm I posted, but seems to touch each number several times, adding a little bit of the remainder. I think that will tend to be somewhat slower than my algorithm, though it turns out to be stable without modification in the case where constraints are all 50 +/- 50.
In some benchmarks I ran -- on slightly cleaned up version of my algorithm, admittedly, and incorporating the fix for pathological remainders -- I saw about a 10% gain over gen3 for the initial case of constraints. The difference was only about 5% for the pathological constraints case. (The original "gen" was substantially faster than both.)
So, Grandfather -- you've got a couple of workable options now. In the benchmarking, all three algorithms were run several hundred thousand times (and checked against constraints each time). I'd suggesting trying them out and confirming the statistical properties of the numbers you get in case either of us introduced some bias unintentionally.
-xdg
Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Need technique for generating constrained random data sets
by BrowserUk (Patriarch) on Feb 08, 2007 at 15:37 UTC | |
The original "gen" was substantially faster than both. I mentioned that gen3 would be slower, but (I thought) fairer. This turns out not to be the case :( The basic algorithm is to allocated the minimum possible amount to each of the values. Subtract that from 100 to get ($remainder). Then allocate random amounts of that remainder to a random recipient until it's all gone. Skipping and trying again if the random allocated amount is greater than the randomly chosen recipient can accept. Whilst not actually determanistic--there is a random element after all :)--it converges to 0 reliably and very quickly always producing a correct solution. And all solutions are possible. The problem with the algorithm is that it is not fair. It favours some solutions over others. This table (split into 3), represents the results generated from a run producing 1e6 solutions to the OP problem. The columns represent the last variable (40-60 range). The rows are the first variable (5 - 35 range). Within the table are two values: the corresponding value for the third variable (5-55 range) on the left and the frequency that the 3-value combination was chosen.
As you can see, all 651 possible solutions were chosen, but their frequencies vary from as low as 23 to a high of 8819. Ideally this would be plotted as a 3D scatter plot to allow it to be visuallised correctly, but I don't have anything that will allow me to do that easily. And I couldn't post the results if I did :( The reason the results are skewed is that when the initial random allocation of the remainder are made, the range of random values that can be chosen is larger than the residual range of any of the values, so the largest residual will be favoured. The change for gen3() is to constrain the size of the random allocations from the remainder at each iteration to a size smaller than the smallest residual after their minimum allocations. This means that algorithm interates more times, but at each stage, it is more likely that whatever value is randomly chosen will be able to accept the allocation--until you reach the point where that value has reached it's upper bound. Unfortunately, a bug in my statistics gathering routine meant that I saw what I was hoping to see. When I correct that bug, it turns out that this slower gen3() algorithm does little better than gen(). In summary, gen3() produces nearly identical (unfair) results to gen(), so if the fairness is not a problem, gen() is the better choice. If fairness is important, my other post above may be worth considering. By producing all solutions and the picking one at random, you ensure fairness, but the price is that it gets very slow as the number of values and their ranges increase. I keep thinking, and have been pursuing a solution that generates all the possible solutions without iterating all the non-possibles and discarding them. I am really quite sure that this is possible for a 3-value problem. And I think it should be extensible to N-values, but I've always had trouble visualising things in greater than 3 dimensions and if I cannot visualise it, I have great trouble coding it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by xdg (Monsignor) on Feb 08, 2007 at 18:31 UTC | |
I'm not really sure that there is a clear sense of what "fairness" means here. These variable are negatively correlated -- if one if them is high in its range, the others have to be lower in their ranges so it all sums to 100. For example, when the sum of midpoints exceeds 100, there will be more combinations that work when variables are below their midpoints. Put differently, one definition of fairness is that all potential combinations are equally likely. Another definition would take into account the implications of the overlapping ranges. Our algorithms perform quite differently as a result. Here are the means for each variable from our two algorithms:
-xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk. | [reply] [d/l] |
by BrowserUk (Patriarch) on Feb 08, 2007 at 18:57 UTC | |
Using the example of two 6-sided dice. If we set a limit of 10 for the sum, then some values will be rejected. Which I think is analogous to this problem. Accumulating the frequency for each value by simply rejecting those over our limit still gives a flat distribution:
Whether a flat distribution is correct for this application is GrandFather's call. But even if some other distribution (poisson or whatever), is required, it would be much easier to arrive at that once you have a generator that produces a known distribution. I'm also unsure how you extend those other distributions to 4 or more dimensions? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by xdg (Monsignor) on Feb 08, 2007 at 22:47 UTC | |
by GrandFather (Saint) on Feb 09, 2007 at 00:17 UTC | |