in reply to Need technique for generating constrained random data sets

This is nearly determanistic and fast, and I believe all possible outcomes are possible (d'uh!:), and I don't think that there is any particular bias. I'm not sure about the fairness though.

#! perl -slw use strict; use List::Util qw[ sum ]; sub gen { my $constraints = shift; my @rands = map $_->{ mid } - $_->{ sd }, @$constraints; my @limits = map $_->{ mid } + $_->{ sd }, @$constraints; my $remainder = 100 - sum @rands; while( $remainder ) { my $target = int rand @rands; my $addition = 1+ int rand $remainder; next if $rands[ $target ] + $addition > $limits[ $target ]; $rands[ $target ] += $addition; $remainder -= $addition; } return @rands; } my @constraints = ( { mid => 20, sd => 15 }, { mid => 30, sd => 25 }, { mid => 50, sd => 10 }, ); print join ' ', gen( \@constraints ) for 1 .. 20; @constraints = ( { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, ); print join ' ', gen( \@constraints ) for 1 .. 20; __END__ C:\test>598736-2.pl | more 12 47 41 13 29 58 25 35 40 27 33 40 11 33 56 15 27 58 25 35 40 32 18 50 34 23 43 14 35 51 24 16 60 5 43 52 13 47 40 29 14 57 33 8 59 23 37 40 29 11 60 7 33 60 5 41 54 33 9 58 10 10 5 8 12 11 5 15 12 12 7 5 11 14 15 7 6 9 11 15 5 5 9 13 14 7 11 12 10 14 14 13 8 14 9 5 12 5 14 6 13 5 14 15 10 5 15 9 9 5 5 10 5 14 13 11 15 13 5 9 5 10 13 14 14 5 13 12 5 9 12 9 14 11 7 14 8 15 5 5 14 14 5 14 5 14 8 5 12 9 15 14 12 15 7 5 10 8 9 5 14 5 12 10 15 14 5 5 9 11 12 12 10 9 9 6 15 8 14 5 11 15 9 13 10 6 15 6 5 10 15 12 14 8 5 10 11 5 15 5 5 13 14 6 14 12 12 14 5 5 14 5 12 14 15 8 5 7 15 5 14 8 14 13 15 5 5 14 5 7 13 7 8 5 13 10 13 13 13 5 14 15 5 7 14 7 13 14 6 5 14 15 15 8 13 6 7 7 5 10

Update: This is slightly slower, but appears to produce fairer results:

sub gen3 { my $constraints = shift; my @rands = map $_->{ mid } - $_->{ sd }, @$constraints; my @limits = map $_->{ mid } + $_->{ sd }, @$constraints; my $maxChange = min map $_->{ sd }, @$constraints; my $remainder = 100 - sum @rands; while( $remainder ) { my $target = int rand @rands; my $addition = 1+ int rand min( $remainder, $maxChange ); next if $rands[ $target ] + $addition > $limits[ $target ]; $rands[ $target ] += $addition; $remainder -= $addition; } return @rands; }

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

Replies are listed 'Best First'.
Re^2: Need technique for generating constrained random data sets
by xdg (Monsignor) on Feb 08, 2007 at 13:21 UTC

    This is similar in some respect to the algorithm I posted, but seems to touch each number several times, adding a little bit of the remainder. I think that will tend to be somewhat slower than my algorithm, though it turns out to be stable without modification in the case where constraints are all 50 +/- 50.

    In some benchmarks I ran -- on slightly cleaned up version of my algorithm, admittedly, and incorporating the fix for pathological remainders -- I saw about a 10% gain over gen3 for the initial case of constraints. The difference was only about 5% for the pathological constraints case. (The original "gen" was substantially faster than both.)

    So, Grandfather -- you've got a couple of workable options now. In the benchmarking, all three algorithms were run several hundred thousand times (and checked against constraints each time). I'd suggesting trying them out and confirming the statistical properties of the numbers you get in case either of us introduced some bias unintentionally.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      The original "gen" was substantially faster than both.

      I mentioned that gen3 would be slower, but (I thought) fairer. This turns out not to be the case :(

      The basic algorithm is to allocated the minimum possible amount to each of the values. Subtract that from 100 to get ($remainder). Then allocate random amounts of that remainder to a random recipient until it's all gone. Skipping and trying again if the random allocated amount is greater than the randomly chosen recipient can accept.

      Whilst not actually determanistic--there is a random element after all :)--it converges to 0 reliably and very quickly always producing a correct solution. And all solutions are possible.

      The problem with the algorithm is that it is not fair. It favours some solutions over others.

      This table (split into 3), represents the results generated from a run producing 1e6 solutions to the OP problem.

      The columns represent the last variable (40-60 range). The rows are the first variable (5 - 35 range). Within the table are two values: the corresponding value for the third variable (5-55 range) on the left and the frequency that the 3-value combination was chosen.

      |40 |41 |42 |43 |44 |45 |46 --+-------+-------+-------+-------+-------+-------+------- 35|25 1388|24 1106|23 1153|22 1259|21 1441|20 1632|19 1726 34|26 1067|25 734|24 811|23 903|22 1021|21 1214|20 1384 33|27 978|26 598|25 690|24 785|23 901|22 1057|21 1201 32|28 964|27 618|26 663|25 753|24 913|23 1062|22 1176 31|29 891|28 588|27 609|26 676|25 794|24 1011|23 1123 30|30 865|29 531|28 579|27 667|26 798|25 934|24 1180 29|31 847|30 497|29 518|28 665|27 743|26 967|25 1054 28|32 790|31 506|30 511|29 614|28 745|27 866|26 1021 27|33 763|32 475|31 568|30 615|29 743|28 850|27 1031 26|34 691|33 470|32 514|31 611|30 724|29 847|28 971 25|35 735|34 452|33 468|32 585|31 683|30 874|29 962 24|36 709|35 416|34 485|33 590|32 686|31 836|30 999 23|37 616|36 403|35 430|34 526|33 667|32 778|31 953 22|38 581|37 340|36 379|35 477|34 588|33 735|32 931 21|39 500|38 335|37 400|36 447|35 572|34 672|33 867 20|40 467|39 306|38 340|37 457|36 536|35 640|34 765 19|41 412|40 268|39 280|38 354|37 443|36 582|35 663 18|42 392|41 223|40 258|39 324|38 403|37 476|36 636 17|43 335|42 231|41 216|40 304|39 340|38 485|37 558 16|44 279|43 173|42 213|41 297|40 329|39 413|38 499 15|45 315|44 185|43 209|42 258|41 314|40 437|39 521 14|46 222|45 125|44 175|43 210|42 254|41 312|40 407 13|47 186|46 104|45 125|44 179|43 208|42 298|41 320 12|48 130|47 76|46 92|45 129|44 153|43 206|42 257 11|49 99|48 63|47 80|46 98|45 120|44 139|43 186 10|50 64|49 57|48 59|47 64|46 98|45 108|44 125 9|51 50|50 36|49 47|48 55|47 77|46 88|45 129 8|52 55|51 27|50 38|49 67|48 68|47 79|46 106 7|53 47|52 25|51 25|50 38|49 44|48 70|47 81 6|54 39|53 24|52 23|51 38|50 31|49 69|48 80 5|55 52|54 38|53 36|52 40|51 65|50 77|49 130 |47 |48 |49 |50 | 51 |52 |53 --+-------+-------+-------+-------+-------+-------+------- 35|18 1986|17 2270|16 2448|15 2983|14 2530|13 2305|12 2140 34|19 1478|18 1668|17 1842|16 2025|15 2056|14 1908|13 1870 33|20 1366|19 1512|18 1682|17 1832|16 1820|15 1984|14 1915 32|21 1316|20 1531|19 1730|18 1821|17 1721|16 1681|15 2021 31|22 1376|21 1490|20 1658|19 1818|18 1700|17 1800|16 1870 30|23 1312|22 1535|21 1749|20 1824|19 1767|18 1808|17 1887 29|24 1229|23 1548|22 1657|21 1914|20 1809|19 1845|18 2009 28|25 1209|24 1431|23 1730|22 1917|21 1815|20 1937|19 1954 27|26 1207|25 1448|24 1750|23 2051|22 1807|21 1970|20 2104 26|27 1175|26 1409|25 1707|24 2047|23 1864|22 2035|21 2148 25|28 1242|27 1385|26 1702|25 2048|24 1944|23 2179|22 2237 24|29 1200|28 1343|27 1665|26 1953|25 1935|24 2100|23 2379 23|30 1163|29 1407|28 1676|27 1890|26 1836|25 2116|24 2343 22|31 1126|30 1270|29 1565|28 1851|27 1815|26 2019|25 2339 21|32 1007|31 1292|30 1507|29 1810|28 1679|27 2006|26 2100 20|33 927|32 1157|31 1425|30 1710|29 1622|28 1877|27 1989 19|34 885|33 973|32 1259|31 1550|30 1524|29 1752|28 2002 18|35 741|34 952|33 1163|32 1426|31 1461|30 1610|29 1786 17|36 722|35 878|34 1052|33 1280|32 1387|31 1450|30 1596 16|37 627|36 783|35 980|34 1172|33 1202|32 1312|31 1542 15|38 598|37 787|36 959|35 1203|34 1226|33 1401|32 1643 14|39 509|38 632|37 834|36 999|35 1000|34 1222|33 1337 13|40 442|39 560|38 648|37 784|36 787|35 921|34 1163 12|41 320|40 399|39 501|38 665|37 597|36 728|35 928 11|42 243|41 365|40 409|39 540|38 540|37 601|36 765 10|43 176|42 254|41 320|40 403|39 409|38 458|37 587 9|44 179|43 192|42 270|41 316|40 312|39 408|38 449 8|45 122|44 160|43 209|42 267|41 239|40 321|39 341 7|46 103|45 115|44 160|43 223|42 198|41 268|40 309 6|47 99|46 108|45 125|44 183|43 181|42 229|41 291 5|48 121|47 168|46 254|45 279|44 311|43 355|42 419 |54 |55 |56 |57 |58 |59 |60 --+-------+-------+-------+-------+-------+-------+------- 35|11 2090|10 2029| 9 2069| 8 1841| 7 1980| 6 2147| 5 3417 34|12 1789|11 1705|10 1584| 9 1560| 8 1646| 7 1790| 6 2455 33|13 1855|12 1712|11 1671|10 1737| 9 1742| 8 1841| 7 2437 32|14 1988|13 1969|12 1874|11 1964|10 1924| 9 2123| 8 2773 31|15 2259|14 2159|13 2160|12 2117|11 2231|10 2354| 9 3236 30|16 2051|15 2514|14 2414|13 2437|12 2578|11 2794|10 3621 29|17 2158|16 2417|15 2834|14 2782|13 2793|12 3173|11 4410 28|18 2182|17 2378|16 2613|15 3274|14 3446|13 3719|12 5159 27|19 2236|18 2357|17 2662|16 3145|15 3728|14 4315|13 5951 26|20 2309|19 2608|18 2787|17 3257|16 3715|15 4947|14 6915 25|21 2520|20 2721|19 3060|18 3294|17 3918|16 4862|15 8045 24|22 2579|21 2896|20 3257|19 3517|18 4172|17 5088|16 8058 23|23 2749|22 2956|21 3296|20 3840|19 4333|18 5447|17 8236 22|24 2623|23 2950|22 3389|21 3942|20 4579|19 5580|18 8478 21|25 2460|24 2903|23 3312|22 3697|21 4437|20 5712|19 8578 20|26 2301|25 2785|24 3189|23 3637|22 4433|21 5524|20 8819 19|27 2210|26 2541|25 3122|24 3636|23 4537|22 5554|21 8553 18|28 2027|27 2375|26 2730|25 3306|24 4288|23 5477|22 8472 17|29 1963|28 2210|27 2726|26 3177|25 3975|24 5142|23 8239 16|30 1845|29 2213|28 2564|27 2969|26 3685|25 4958|24 7985 15|31 1904|30 2142|29 2639|28 3139|27 3925|26 5044|25 7934 14|32 1579|31 1757|30 2209|29 2650|28 3260|27 4230|26 6789 13|33 1250|32 1564|31 1833|30 2170|29 2773|28 3596|27 5750 12|34 1106|33 1243|32 1555|31 1853|30 2454|29 3094|28 4962 11|35 861|34 1026|33 1208|32 1543|31 1876|30 2565|29 4114 10|36 694|35 849|34 1030|33 1266|32 1631|31 2101|30 3474 9|37 561|36 625|35 804|34 968|33 1289|32 1802|31 2889 8|38 455|37 516|36 689|35 792|34 1032|33 1400|32 2416 7|39 310|38 448|37 554|36 629|35 845|34 1190|33 2073 6|40 328|39 402|38 445|37 596|36 794|35 1003|34 1755 5|41 475|40 599|39 703|38 918|37 1180|36 1447|35 2061 651 selections were made

      As you can see, all 651 possible solutions were chosen, but their frequencies vary from as low as 23 to a high of 8819. Ideally this would be plotted as a 3D scatter plot to allow it to be visuallised correctly, but I don't have anything that will allow me to do that easily. And I couldn't post the results if I did :(

      The reason the results are skewed is that when the initial random allocation of the remainder are made, the range of random values that can be chosen is larger than the residual range of any of the values, so the largest residual will be favoured.

      The change for gen3() is to constrain the size of the random allocations from the remainder at each iteration to a size smaller than the smallest residual after their minimum allocations. This means that algorithm interates more times, but at each stage, it is more likely that whatever value is randomly chosen will be able to accept the allocation--until you reach the point where that value has reached it's upper bound.

      Unfortunately, a bug in my statistics gathering routine meant that I saw what I was hoping to see. When I correct that bug, it turns out that this slower gen3() algorithm does little better than gen().

      In summary, gen3() produces nearly identical (unfair) results to gen(), so if the fairness is not a problem, gen() is the better choice.

      If fairness is important, my other post above may be worth considering. By producing all solutions and the picking one at random, you ensure fairness, but the price is that it gets very slow as the number of values and their ranges increase. I keep thinking, and have been pursuing a solution that generates all the possible solutions without iterating all the non-possibles and discarding them.

      I am really quite sure that this is possible for a 3-value problem. And I think it should be extensible to N-values, but I've always had trouble visualising things in greater than 3 dimensions and if I cannot visualise it, I have great trouble coding it.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I'm not really sure that there is a clear sense of what "fairness" means here. These variable are negatively correlated -- if one if them is high in its range, the others have to be lower in their ranges so it all sums to 100.

        For example, when the sum of midpoints exceeds 100, there will be more combinations that work when variables are below their midpoints.

        Put differently, one definition of fairness is that all potential combinations are equally likely. Another definition would take into account the implications of the overlapping ranges.

        Our algorithms perform quite differently as a result. Here are the means for each variable from our two algorithms:

        # Original constraints (xdg algorithm) Expected: 20.0 30.0 50.0 Means 19.5 30.5 50.0 # Pathological constraints (xdg algorithm) Expected: 50.0 50.0 50.0 Means 49.5 24.8 25.7 # Original constraints (gen3 algorithm) Expected: 20.0 30.0 50.0 Means 22.6 23.2 54.2 # Pathological constraints (gen3 algorithm) Expected: 50.0 50.0 50.0 Means: 33.5 33.4 33.1

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.