Re^2: Need technique for generating constrained random data sets

Replies are listed 'Best First'.
Re^3: Need technique for generating constrained random data sets by BrowserUk (Patriarch) on Feb 08, 2007 at 15:37 UTC
The original "gen" was substantially faster than both. I mentioned that `gen3` would be slower, but (I thought) fairer. This turns out not to be the case :( The basic algorithm is to allocated the minimum possible amount to each of the values. Subtract that from 100 to get (`$remainder`). Then allocate random amounts of that remainder to a random recipient until it's all gone. Skipping and trying again if the random allocated amount is greater than the randomly chosen recipient can accept. Whilst not actually determanistic--there is a random element after all :)--it converges to 0 reliably and very quickly always producing a correct solution. And all solutions are possible. The problem with the algorithm is that it is not fair. It favours some solutions over others. This table (split into 3), represents the results generated from a run producing 1e6 solutions to the OP problem. The columns represent the last variable (40-60 range). The rows are the first variable (5 - 35 range). Within the table are two values: the corresponding value for the third variable (5-55 range) on the left and the frequency that the 3-value combination was chosen. \|40 \|41 \|42 \|43 \|44 \|45 \|46 --+-------+-------+-------+-------+-------+-------+------- 35\|25 1388\|24 1106\|23 1153\|22 1259\|21 1441\|20 1632\|19 1726 34\|26 1067\|25 734\|24 811\|23 903\|22 1021\|21 1214\|20 1384 33\|27 978\|26 598\|25 690\|24 785\|23 901\|22 1057\|21 1201 32\|28 964\|27 618\|26 663\|25 753\|24 913\|23 1062\|22 1176 31\|29 891\|28 588\|27 609\|26 676\|25 794\|24 1011\|23 1123 30\|30 865\|29 531\|28 579\|27 667\|26 798\|25 934\|24 1180 29\|31 847\|30 497\|29 518\|28 665\|27 743\|26 967\|25 1054 28\|32 790\|31 506\|30 511\|29 614\|28 745\|27 866\|26 1021 27\|33 763\|32 475\|31 568\|30 615\|29 743\|28 850\|27 1031 26\|34 691\|33 470\|32 514\|31 611\|30 724\|29 847\|28 971 25\|35 735\|34 452\|33 468\|32 585\|31 683\|30 874\|29 962 24\|36 709\|35 416\|34 485\|33 590\|32 686\|31 836\|30 999 23\|37 616\|36 403\|35 430\|34 526\|33 667\|32 778\|31 953 22\|38 581\|37 340\|36 379\|35 477\|34 588\|33 735\|32 931 21\|39 500\|38 335\|37 400\|36 447\|35 572\|34 672\|33 867 20\|40 467\|39 306\|38 340\|37 457\|36 536\|35 640\|34 765 19\|41 412\|40 268\|39 280\|38 354\|37 443\|36 582\|35 663 18\|42 392\|41 223\|40 258\|39 324\|38 403\|37 476\|36 636 17\|43 335\|42 231\|41 216\|40 304\|39 340\|38 485\|37 558 16\|44 279\|43 173\|42 213\|41 297\|40 329\|39 413\|38 499 15\|45 315\|44 185\|43 209\|42 258\|41 314\|40 437\|39 521 14\|46 222\|45 125\|44 175\|43 210\|42 254\|41 312\|40 407 13\|47 186\|46 104\|45 125\|44 179\|43 208\|42 298\|41 320 12\|48 130\|47 76\|46 92\|45 129\|44 153\|43 206\|42 257 11\|49 99\|48 63\|47 80\|46 98\|45 120\|44 139\|43 186 10\|50 64\|49 57\|48 59\|47 64\|46 98\|45 108\|44 125 9\|51 50\|50 36\|49 47\|48 55\|47 77\|46 88\|45 129 8\|52 55\|51 27\|50 38\|49 67\|48 68\|47 79\|46 106 7\|53 47\|52 25\|51 25\|50 38\|49 44\|48 70\|47 81 6\|54 39\|53 24\|52 23\|51 38\|50 31\|49 69\|48 80 5\|55 52\|54 38\|53 36\|52 40\|51 65\|50 77\|49 130 \|47 \|48 \|49 \|50 \| 51 \|52 \|53 --+-------+-------+-------+-------+-------+-------+------- 35\|18 1986\|17 2270\|16 2448\|15 2983\|14 2530\|13 2305\|12 2140 34\|19 1478\|18 1668\|17 1842\|16 2025\|15 2056\|14 1908\|13 1870 33\|20 1366\|19 1512\|18 1682\|17 1832\|16 1820\|15 1984\|14 1915 32\|21 1316\|20 1531\|19 1730\|18 1821\|17 1721\|16 1681\|15 2021 31\|22 1376\|21 1490\|20 1658\|19 1818\|18 1700\|17 1800\|16 1870 30\|23 1312\|22 1535\|21 1749\|20 1824\|19 1767\|18 1808\|17 1887 29\|24 1229\|23 1548\|22 1657\|21 1914\|20 1809\|19 1845\|18 2009 28\|25 1209\|24 1431\|23 1730\|22 1917\|21 1815\|20 1937\|19 1954 27\|26 1207\|25 1448\|24 1750\|23 2051\|22 1807\|21 1970\|20 2104 26\|27 1175\|26 1409\|25 1707\|24 2047\|23 1864\|22 2035\|21 2148 25\|28 1242\|27 1385\|26 1702\|25 2048\|24 1944\|23 2179\|22 2237 24\|29 1200\|28 1343\|27 1665\|26 1953\|25 1935\|24 2100\|23 2379 23\|30 1163\|29 1407\|28 1676\|27 1890\|26 1836\|25 2116\|24 2343 22\|31 1126\|30 1270\|29 1565\|28 1851\|27 1815\|26 2019\|25 2339 21\|32 1007\|31 1292\|30 1507\|29 1810\|28 1679\|27 2006\|26 2100 20\|33 927\|32 1157\|31 1425\|30 1710\|29 1622\|28 1877\|27 1989 19\|34 885\|33 973\|32 1259\|31 1550\|30 1524\|29 1752\|28 2002 18\|35 741\|34 952\|33 1163\|32 1426\|31 1461\|30 1610\|29 1786 17\|36 722\|35 878\|34 1052\|33 1280\|32 1387\|31 1450\|30 1596 16\|37 627\|36 783\|35 980\|34 1172\|33 1202\|32 1312\|31 1542 15\|38 598\|37 787\|36 959\|35 1203\|34 1226\|33 1401\|32 1643 14\|39 509\|38 632\|37 834\|36 999\|35 1000\|34 1222\|33 1337 13\|40 442\|39 560\|38 648\|37 784\|36 787\|35 921\|34 1163 12\|41 320\|40 399\|39 501\|38 665\|37 597\|36 728\|35 928 11\|42 243\|41 365\|40 409\|39 540\|38 540\|37 601\|36 765 10\|43 176\|42 254\|41 320\|40 403\|39 409\|38 458\|37 587 9\|44 179\|43 192\|42 270\|41 316\|40 312\|39 408\|38 449 8\|45 122\|44 160\|43 209\|42 267\|41 239\|40 321\|39 341 7\|46 103\|45 115\|44 160\|43 223\|42 198\|41 268\|40 309 6\|47 99\|46 108\|45 125\|44 183\|43 181\|42 229\|41 291 5\|48 121\|47 168\|46 254\|45 279\|44 311\|43 355\|42 419 \|54 \|55 \|56 \|57 \|58 \|59 \|60 --+-------+-------+-------+-------+-------+-------+------- 35\|11 2090\|10 2029\| 9 2069\| 8 1841\| 7 1980\| 6 2147\| 5 3417 34\|12 1789\|11 1705\|10 1584\| 9 1560\| 8 1646\| 7 1790\| 6 2455 33\|13 1855\|12 1712\|11 1671\|10 1737\| 9 1742\| 8 1841\| 7 2437 32\|14 1988\|13 1969\|12 1874\|11 1964\|10 1924\| 9 2123\| 8 2773 31\|15 2259\|14 2159\|13 2160\|12 2117\|11 2231\|10 2354\| 9 3236 30\|16 2051\|15 2514\|14 2414\|13 2437\|12 2578\|11 2794\|10 3621 29\|17 2158\|16 2417\|15 2834\|14 2782\|13 2793\|12 3173\|11 4410 28\|18 2182\|17 2378\|16 2613\|15 3274\|14 3446\|13 3719\|12 5159 27\|19 2236\|18 2357\|17 2662\|16 3145\|15 3728\|14 4315\|13 5951 26\|20 2309\|19 2608\|18 2787\|17 3257\|16 3715\|15 4947\|14 6915 25\|21 2520\|20 2721\|19 3060\|18 3294\|17 3918\|16 4862\|15 8045 24\|22 2579\|21 2896\|20 3257\|19 3517\|18 4172\|17 5088\|16 8058 23\|23 2749\|22 2956\|21 3296\|20 3840\|19 4333\|18 5447\|17 8236 22\|24 2623\|23 2950\|22 3389\|21 3942\|20 4579\|19 5580\|18 8478 21\|25 2460\|24 2903\|23 3312\|22 3697\|21 4437\|20 5712\|19 8578 20\|26 2301\|25 2785\|24 3189\|23 3637\|22 4433\|21 5524\|20 8819 19\|27 2210\|26 2541\|25 3122\|24 3636\|23 4537\|22 5554\|21 8553 18\|28 2027\|27 2375\|26 2730\|25 3306\|24 4288\|23 5477\|22 8472 17\|29 1963\|28 2210\|27 2726\|26 3177\|25 3975\|24 5142\|23 8239 16\|30 1845\|29 2213\|28 2564\|27 2969\|26 3685\|25 4958\|24 7985 15\|31 1904\|30 2142\|29 2639\|28 3139\|27 3925\|26 5044\|25 7934 14\|32 1579\|31 1757\|30 2209\|29 2650\|28 3260\|27 4230\|26 6789 13\|33 1250\|32 1564\|31 1833\|30 2170\|29 2773\|28 3596\|27 5750 12\|34 1106\|33 1243\|32 1555\|31 1853\|30 2454\|29 3094\|28 4962 11\|35 861\|34 1026\|33 1208\|32 1543\|31 1876\|30 2565\|29 4114 10\|36 694\|35 849\|34 1030\|33 1266\|32 1631\|31 2101\|30 3474 9\|37 561\|36 625\|35 804\|34 968\|33 1289\|32 1802\|31 2889 8\|38 455\|37 516\|36 689\|35 792\|34 1032\|33 1400\|32 2416 7\|39 310\|38 448\|37 554\|36 629\|35 845\|34 1190\|33 2073 6\|40 328\|39 402\|38 445\|37 596\|36 794\|35 1003\|34 1755 5\|41 475\|40 599\|39 703\|38 918\|37 1180\|36 1447\|35 2061 651 selections were made [download] As you can see, all 651 possible solutions were chosen, but their frequencies vary from as low as 23 to a high of 8819. Ideally this would be plotted as a 3D scatter plot to allow it to be visuallised correctly, but I don't have anything that will allow me to do that easily. And I couldn't post the results if I did :( The reason the results are skewed is that when the initial random allocation of the remainder are made, the range of random values that can be chosen is larger than the residual range of any of the values, so the largest residual will be favoured. The change for `gen3()` is to constrain the size of the random allocations from the remainder at each iteration to a size smaller than the smallest residual after their minimum allocations. This means that algorithm interates more times, but at each stage, it is more likely that whatever value is randomly chosen will be able to accept the allocation--until you reach the point where that value has reached it's upper bound. Unfortunately, a bug in my statistics gathering routine meant that I saw what I was hoping to see. When I correct that bug, it turns out that this slower `gen3()` algorithm does little better than `gen()`. In summary, `gen3()` produces nearly identical (unfair) results to `gen()`, so if the fairness is not a problem, `gen()` is the better choice. If fairness is important, my other post above may be worth considering. By producing all solutions and the picking one at random, you ensure fairness, but the price is that it gets very slow as the number of values and their ranges increase. I keep thinking, and have been pursuing a solution that generates all the possible solutions without iterating all the non-possibles and discarding them. I am really quite sure that this is possible for a 3-value problem. And I think it should be extensible to N-values, but I've always had trouble visualising things in greater than 3 dimensions and if I cannot visualise it, I have great trouble coding it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Need technique for generating constrained random data sets by xdg (Monsignor) on Feb 08, 2007 at 18:31 UTC
I'm not really sure that there is a clear sense of what "fairness" means here. These variable are negatively correlated -- if one if them is high in its range, the others have to be lower in their ranges so it all sums to 100. For example, when the sum of midpoints exceeds 100, there will be more combinations that work when variables are below their midpoints. Put differently, one definition of fairness is that all potential combinations are equally likely. Another definition would take into account the implications of the overlapping ranges. Our algorithms perform quite differently as a result. Here are the means for each variable from our two algorithms: `# Original constraints (xdg algorithm) Expected: 20.0 30.0 50.0 Means 19.5 30.5 50.0 # Pathological constraints (xdg algorithm) Expected: 50.0 50.0 50.0 Means 49.5 24.8 25.7 # Original constraints (gen3 algorithm) Expected: 20.0 30.0 50.0 Means 22.6 23.2 54.2 # Pathological constraints (gen3 algorithm) Expected: 50.0 50.0 50.0 Means: 33.5 33.4 33.1` [download] -xdg Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.	[reply] [d/l]
Re^5: Need technique for generating constrained random data sets by BrowserUk (Patriarch) on Feb 08, 2007 at 18:57 UTC
Using the example of two 6-sided dice. If we set a limit of 10 for the sum, then some values will be rejected. Which I think is analogous to this problem. Accumulating the frequency for each value by simply rejecting those over our limit still gives a flat distribution: #! perl -slw use strict; use List::Util qw[ sum ]; my %stats; for( 1 .. $ARGV[ 0 ] \|\| 1e3 ) { my @dice = map 1 + int( rand 6 ), 1 .. 2; next if sum( @dice ) > 10; ++$stats{ "@dice" }; } my $mean = sum( values %stats ) / scalar values %stats; printf "%12s %d dev( %d )\n", $_, $stats{ $_ }, abs( $stats{ $_ } - $mean ) for map unpack( 'x[C2] A', $_ ), sort map pack( 'C2 A', ( reverse split( ' ', $_ ) ), $_ ), keys %stats; __END__ C:\test>junk7 1e6 1 1 27534 dev( 247 ) 2 1 27640 dev( 141 ) 3 1 27742 dev( 39 ) 4 1 27878 dev( 96 ) 5 1 28161 dev( 379 ) 6 1 27556 dev( 225 ) 1 2 28018 dev( 236 ) 2 2 27426 dev( 355 ) 3 2 27650 dev( 131 ) 4 2 28099 dev( 317 ) 5 2 27949 dev( 167 ) 6 2 27607 dev( 174 ) 1 3 27752 dev( 29 ) 2 3 28033 dev( 251 ) 3 3 27767 dev( 14 ) 4 3 27730 dev( 51 ) 5 3 27802 dev( 20 ) 6 3 27760 dev( 21 ) 1 4 27791 dev( 9 ) 2 4 28065 dev( 283 ) 3 4 27894 dev( 112 ) 4 4 27487 dev( 294 ) 5 4 27778 dev( 3 ) 6 4 27935 dev( 153 ) 1 5 27708 dev( 73 ) 2 5 27668 dev( 113 ) 3 5 27854 dev( 72 ) 4 5 27810 dev( 28 ) 5 5 27654 dev( 127 ) 1 6 27600 dev( 181 ) 2 6 27700 dev( 81 ) 3 6 28029 dev( 247 ) 4 6 27717 dev( 64 ) [download] Whether a flat distribution is correct for this application is GrandFather's call. But even if some other distribution (poisson or whatever), is required, it would be much easier to arrive at that once you have a generator that produces a known distribution. I'm also unsure how you extend those other distributions to 4 or more dimensions? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l]
Re^6: Need technique for generating constrained random data sets by xdg (Monsignor) on Feb 08, 2007 at 22:47 UTC
Re^7: Need technique for generating constrained random data sets by GrandFather (Saint) on Feb 09, 2007 at 00:17 UTC