GrandFather has asked for the wisdom of the Perl Monks concerning the following question:

In some simulation code I am working on I need to generate a random set of percentages for various components. The sum of the individual percentages for the components must be 100, but there are constraints on the minimum and maximum percentage for each component.

The current code for generating simulation sets looks somewhat like that shown below.

use strict; use warnings; my @constraints = ( {mid => 20, sd => 15}, {mid => 30, sd => 25}, {mid => 50, sd => 10}, ); my $sum; for my $cnst (@constraints) { my $value = RandFlat ($cnst->{mid}, $cnst->{sd}); $cnst->{value} = $value; $sum += $value; } #scale so that percentages add up to 100 # this may push parameters outside bounds! my $scale = 100.0 / $sum; for my $cnst (@constraints) { my ($mid, $sd, $value) = @{$cnst}{'mid', 'sd', 'value'}; $value = $value * $scale; printf "%5.1f +-%5.1f: %5.1f", $mid, $sd, $value; print " - bad" if ($value < ($mid - $sd)) || ($value > ($mid + $s +d)); print "\n"; } sub RandFlat { #Return a rand () value with a flat distribution about the $mean + +- $stdDev my ($mean, $stdDev) = @_; my $range = 2.0 * $stdDev; my $value = rand ($range); return $value + ($mean-$stdDev); }

Prints:

20.0 +- 15.0: 22.7 30.0 +- 25.0: 40.1 50.0 +- 10.0: 37.2 - bad

However this technique can generate sets that do not conform to the constraints (as shown). Depending on the actual constraints the probability of generating a compliant data set may be rather low! Is there a better technique for solving this problem?

Notes:


DWIM is Perl's answer to Gödel

Replies are listed 'Best First'.
Re: Need technique for generating constrained random data sets
by BrowserUk (Patriarch) on Feb 07, 2007 at 13:21 UTC

    As you are going to need millions of datsets, it may be better and will certainly give you a more fairly random selection, to generate all the possible compliant sets and then choose one of them randomly each time you need one.

    #! perl -slw use strict; use List::Util qw[ sum ]; sub Nfor (&@) { my( $code, @ranges ) = @_; my @indices = ( 0 ) x @ranges; $indices[ $#indices ]--; my $tumbler = $#indices; my @returns; while( 1 ) { if( ++$indices[ $tumbler ] > $#{ $ranges[ $tumbler ] } ) { $indices[ $tumbler ] = 0; last if --$tumbler < 0; next; } local $SIG{__WARN__} = sub{ warn @_ unless "@_" =~ /Exiting/ } +; if( defined wantarray ) { push @returns, $code->( map $ranges[ $_ ][ $indices[ $_ ] ], 0 .. $#indices ); } else { () = $code->( map $ranges[ $_ ][ $indices[ $_ ] ], 0 .. $# +indices ); } $tumbler = $#indices; } defined wantarray ? wantarray ? @returns : \@returns : (); } sub genPossibles{ my $constraints = shift; my @possibles; Nfor{ my $sum = sum( @_ ); next if $sum > 100; push @possibles, \@_ if $sum == 100; } map { [ $_->{ mid } - $_->{ sd } .. $_->{ mid } + $_->{ sd } ] } @$constraints; return \@possibles; } my @constraints = ( { mid => 20, sd => 15 }, { mid => 30, sd => 25 }, { mid => 50, sd => 10 }, ); my $possibles = genPossibles( \@constraints ); printf "There are %d possible selections\n", scalar @$possibles; print "Here is one of them: @{ @{ $possibles }[ rand @$possibles ] }"; __END__ C:\test>junk7 There are 651 possible selections Here is one of them: 20 39 41 C:\test>junk7 There are 651 possible selections Here is one of them: 27 32 41 C:\test>junk7 There are 651 possible selections Here is one of them: 17 29 54

    My Nfor() sub is a not well tested attempt at a Algorithm::Loops::NestedLoops() act-a-like, but you could use that module instead.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Need technique for generating constrained random data sets
by polettix (Vicar) on Feb 07, 2007 at 12:40 UTC
    First of all, it seems quite strange that the mean values for each item do not add up to 100% - are you sure about your model?

    Second, if you happen to have N random variables with a constraint, you simply happen to have only (N-1) random variables. If the sum of N items must add up to 100, you can only choose (N-1) of them, otherwise you'll end up breaking the constraint with - ehr - probability 1.

    Of course, nobody is saying that you must always choose the same set of N "free" variables! For each iteration, I would suggest a two stage process, in which:

    • choose the (N-1) variables you want to randomize
    • draw random values for these (N-1) variables
    • calculate the value for the remaining variable based on the constraint

    An iteration must be rejected if the (N-1) values do not allow the correct generation of the Nth, of course. This would happen if the (N-1) values add up to more than 100, for example; the probability of this happening is somehow correlated with the variance of the (N-1) variables. In your case, this rejection is easily spotted, because it would lead to negative values for the N-th percentage.

    The above process should also address the "mean values do not add up to 100", even if I would repeat my suggestion to verify your model about this.

    Flavio
    perl -ple'$_=reverse' <<<ti.xittelop@oivalf

    Don't fool yourself.

      The backstory is that this is from some inherited code that tries to guess what diet might generate a particular isotope balance measured in bones from an archeological dig. The current technique identifies seven food groups that alter the isotope balance of three elements in different ways. There is not enough information to determine the diet directly from the isotope information so some other technique is required.

      The technique adopted by the software I'm working on generates random diets comprising some proportion of each dietry element. It then calculates the expected isotope balance each diet would generate and records information for those that generate a balance close to the target balance.

      The search space can be narrowed somewhat if you can provide constraints on the proportion of the diet that can be contributed by each component. For example, it is not possible to live on a diet of shell fish alone so a limit can be placed on the maximum contribution that shell fish could make to the diet without causing death in short order (about as long as eating nothing at all actually!).

      So the constraints represent the allowable range for the proportion each dietry element may contribute to the total diet. Obviously for any particular diet the total of the contributing components is 100%.

      A simple set of constraints would be to set the mean for each element to 50 and the range to +- 50 - that is, allow any value. In that case the sum of the means would be 350. Completely legitimate in the context of the problem space, but it doesn't constrain the search much!


      DWIM is Perl's answer to Gödel
Re: Need technique for generating constrained random data sets
by Limbic~Region (Chancellor) on Feb 07, 2007 at 11:50 UTC
Re: Need technique for generating constrained random data sets
by herveus (Prior) on Feb 07, 2007 at 13:08 UTC
    Howdy!

    When a trial set is bad, either you need to toss it entirely and start over, or you bring the "bad" value into range by taking from values that are in range. This gets more interesting if more than one value is out of range. If you need four values, and the third can't satisfy it's constraints, the fourth won't have any of the pie left for it.

    Hmmm... I'm visualizing a pie divider that has n cutters, with the spacing between each cutter constrained to x +/- y.

    I'm too lazy right now to work up any code fragments.

    I'm not visualizing an approach that will reliably produce valid values without some sort of iteration, either by tossing entire sets that fail, or by adjusting the values. If you pick values for each component without regard to the total and then normalize them so the sum is 100, you will get fewer bad sets, but I can see how normalizing could push a value near the limit out of bounds.

    20.0 +- 15.0: 22.7 30.0 +- 25.0: 40.1 50.0 +- 10.0: 37.2 - bad
    1. pick values, say (22.7, 40.1, 57.3) - sum = 120.1
    2. normalize by multiplying by 100/120.1 -> (18.9, 33.4, 47.7) - sum = 100 -> happiness
    3. pick values, say (32.7, 54.3, 42.9) -> sum = 129.9
    4. normalize by multiplying by 100/129.9 -> (25.2, 41.8, 33.0) - sum = 100 but third number too small -> not happiness
    5. pin out of range value to lower limit -> (25.2, 41.8, 40) - sum = 117
    6. renormalize by multiplying values in range by 60/67 (their share/their sum) -> (22.6, 37.4, 40) - sum = 100 -> happiness

    Yeah, it's iterative, but so long as the constraints allow a result, it will converge. It's that mechanical pie divider thingy. Sometimes, one of the dividers runs up against its stops and becomes pinned.

    yours,
    Michael

      Sadly pining values does not lead to happiness because it mucks up the distribution of values - the pined values are likely to be vastly over represented.


      DWIM is Perl's answer to Gödel
Re: Need technique for generating constrained random data sets
by kyle (Abbot) on Feb 07, 2007 at 12:35 UTC
    use strict; use warnings; my @constraints = ( {mid => 20, sd => 15}, {mid => 30, sd => 25}, {mid => 50, sd => 10}, ); my $sum; for my $cnst (@constraints) { my $value = RandFlat ($cnst->{mid}, $cnst->{sd}); $cnst->{value} = $value; $sum += $value; } # want $diff = 0 my $diff = $sum - 100; my $adj_sum = 0; # positive $diff means $sum is too high, go lower if ( $diff > 0 ) { for my $cnst (@constraints) { # how far can this value be adjusted? my $adj = $cnst->{value} - ($cnst->{mid} - $cnst->{sd}); $cnst->{adj} = $adj; $adj_sum += $adj; } } # negative $diff means $sum is too low, go higher elsif ( $diff < 0 ) { for my $cnst (@constraints) { # how far can this value be adjusted? my $adj = $cnst->{mid} + $cnst->{sd} - $cnst->{value}; $cnst->{adj} = $adj; $adj_sum += $adj; } } # possible adjustment vs. adjustment needed my $scale = $adj_sum / $diff; for my $cnst (@constraints) { my $adj = $cnst->{adj}; $cnst->{value} -= $cnst->{adj} / $scale; } $sum = 0; for my $cnst (@constraints) { my ($mid, $sd, $value) = @{$cnst}{'mid', 'sd', 'value'}; printf "%5.1f +-%5.1f: %5.1f", $mid, $sd, $value; print " - bad" if ($value < ($mid - $sd)) || ($value > ($mid + $sd +)); print "\n"; $sum += $value; } print "sum: $sum\n"; sub RandFlat { #Return a rand () value with a flat distribution about the $mean + +- $stdDev my ($mean, $stdDev) = @_; my $range = 2.0 * $stdDev; my $value = rand ($range); return $value + ($mean-$stdDev); }
Re: Need technique for generating constrained random data sets
by xdg (Monsignor) on Feb 07, 2007 at 15:09 UTC
    my @constraints = ( {mid => 20, sd => 15}, {mid => 30, sd => 25}, {mid => 50, sd => 10}, );

    As others have implied, I don't think you're going to be able to do this deterministically, as you have over-specified your solution set. The sum of your maximums is greater than 100 and the sum of your minimums is less than 100.

    What comes to mind to minimize the number of discarded sets and avoid scaling is to pick the numbers in descending order of midpoint. After generating each number, check if the remainder is more than the sum of the minimums remaining and less than the sum of the maximums remaining. If it's outside the bounds, repick the last number. When you're down to the last number, you don't generate it randomly, it's just the remainder.

    Basically, you're pruning off choices that can't satisfy the remaining constraints.

    Without thinking it through further, I'm worried that doing it in descending order of midpoint might bias the results. I think you could pick them in random order and you're just more likely to have to repick numbers along the way.

    Update: Here's a code sample:

    use strict; use warnings; sub RandFlat { #Return a rand () value with a flat distribution about the $mean + +- $stdDev my ($mean, $stdDev) = @_; my $range = 2.0 * $stdDev; my $value = rand ($range); return $value + ($mean-$stdDev); } sub generate_set { my ($remainder, @constraints) = @_; my @results; my ($sum_of_minima, $sum_of_maxima); for my $c ( @constraints ) { $sum_of_minima += $c->{mid} - $c->{sd}; $sum_of_maxima += $c->{mid} + $c->{sd}; } # iterate through N-1 constraints in descending order my @descending = sort { $b->{mid} <=> $a->{mid} } @constraints; my $last_value = pop @descending; for my $c ( @descending ) { my $n = RandFlat( $c->{mid}, $c->{sd} ); # repeat if remainder outside sum of the remaining # minima and maxima contraints my $new_remainder = $remainder - $n; my $new_sum_of_minima = $sum_of_minima - ( $c->{mid} - $c->{sd +} ); my $new_sum_of_maxima = $sum_of_maxima - ( $c->{mid} + $c->{sd +} ); redo if ( $new_remainder < $new_sum_of_minima) || ( $new_remainder > $new_sum_of_maxima); # otherwise save number and update the remainder and constrain +ts push @results, [ $c->{mid}, $c->{sd}, $n ]; $remainder = $new_remainder; $sum_of_minima = $new_sum_of_minima; $sum_of_maxima = $new_sum_of_maxima; } # the remainder must now satisfy the final constraint return @results, [ $last_value->{mid}, $last_value->{sd}, $remaind +er ]; } my $total = 100; my @constraints = ( {mid => 20, sd => 15}, {mid => 30, sd => 25}, {mid => 50, sd => 10}, ); for ( 1 .. 5 ) { for my $result ( generate_set($total, @constraints) ) { my ($mid, $sd, $value) = @$result; printf "%5.1f +-%5.1f: %5.1f", $mid, $sd, $value; print " - bad" if ($value < ($mid - $sd)) || ($value > ($mid + + $sd)); print "\n"; } print "\n"; }

    I ran it 1000 times and didn't see any bad results.

    Also, thinking it through again, I don't think the descending order will wind up biased (but I could be convinced by a good argument).

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      Some such approach as this had occured to me already, but I couldn't get my head around implementing it. The comment regarding the "over-specified solution set" is helpfull in my realising why I'm having trouble with this.

      It's not clear to me that your code would be happy with a set of constraints such as:

      my @constraints = ( {mid => 50, sd => 50}, {mid => 50, sd => 50}, {mid => 50, sd => 50}, );

      See Re^2: Need technique for generating constrained random data sets for some of the background.


      DWIM is Perl's answer to Gödel

        I just tried it and it works fine:

        50.0 +- 50.0: 43.1 50.0 +- 50.0: 18.8 50.0 +- 50.0: 38.2 50.0 +- 50.0: 32.8 50.0 +- 50.0: 13.4 50.0 +- 50.0: 53.8 50.0 +- 50.0: 67.6 50.0 +- 50.0: 25.2 50.0 +- 50.0: 7.2 50.0 +- 50.0: 76.4 50.0 +- 50.0: 10.9 50.0 +- 50.0: 12.7 50.0 +- 50.0: 89.9 50.0 +- 50.0: 3.1 50.0 +- 50.0: 7.0

        There is risk of getting trapped and not converging on an answer. E.g. if the first number is 99, then the second number will still be randomly generated between 0 and 100 and will be thrown out repeatedly unless it's less than 1. Ditto for the third number but in an even tighter range.

        However, I think that can be avoided by constraining each random number to be in the range from (mid - sd) to min(mid + sd, remainder). In the example above, the second number would be chosen between 0 and 1. Assuming that was 0.6, then the third number would be chosen between 0 and 0.4. That would be guaranteed to converge quickly.

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Need technique for generating constrained random data sets
by BrowserUk (Patriarch) on Feb 07, 2007 at 15:50 UTC

    This is nearly determanistic and fast, and I believe all possible outcomes are possible (d'uh!:), and I don't think that there is any particular bias. I'm not sure about the fairness though.

    #! perl -slw use strict; use List::Util qw[ sum ]; sub gen { my $constraints = shift; my @rands = map $_->{ mid } - $_->{ sd }, @$constraints; my @limits = map $_->{ mid } + $_->{ sd }, @$constraints; my $remainder = 100 - sum @rands; while( $remainder ) { my $target = int rand @rands; my $addition = 1+ int rand $remainder; next if $rands[ $target ] + $addition > $limits[ $target ]; $rands[ $target ] += $addition; $remainder -= $addition; } return @rands; } my @constraints = ( { mid => 20, sd => 15 }, { mid => 30, sd => 25 }, { mid => 50, sd => 10 }, ); print join ' ', gen( \@constraints ) for 1 .. 20; @constraints = ( { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, { mid => 10, sd => 5 }, ); print join ' ', gen( \@constraints ) for 1 .. 20; __END__ C:\test>598736-2.pl | more 12 47 41 13 29 58 25 35 40 27 33 40 11 33 56 15 27 58 25 35 40 32 18 50 34 23 43 14 35 51 24 16 60 5 43 52 13 47 40 29 14 57 33 8 59 23 37 40 29 11 60 7 33 60 5 41 54 33 9 58 10 10 5 8 12 11 5 15 12 12 7 5 11 14 15 7 6 9 11 15 5 5 9 13 14 7 11 12 10 14 14 13 8 14 9 5 12 5 14 6 13 5 14 15 10 5 15 9 9 5 5 10 5 14 13 11 15 13 5 9 5 10 13 14 14 5 13 12 5 9 12 9 14 11 7 14 8 15 5 5 14 14 5 14 5 14 8 5 12 9 15 14 12 15 7 5 10 8 9 5 14 5 12 10 15 14 5 5 9 11 12 12 10 9 9 6 15 8 14 5 11 15 9 13 10 6 15 6 5 10 15 12 14 8 5 10 11 5 15 5 5 13 14 6 14 12 12 14 5 5 14 5 12 14 15 8 5 7 15 5 14 8 14 13 15 5 5 14 5 7 13 7 8 5 13 10 13 13 13 5 14 15 5 7 14 7 13 14 6 5 14 15 15 8 13 6 7 7 5 10

    Update: This is slightly slower, but appears to produce fairer results:

    sub gen3 { my $constraints = shift; my @rands = map $_->{ mid } - $_->{ sd }, @$constraints; my @limits = map $_->{ mid } + $_->{ sd }, @$constraints; my $maxChange = min map $_->{ sd }, @$constraints; my $remainder = 100 - sum @rands; while( $remainder ) { my $target = int rand @rands; my $addition = 1+ int rand min( $remainder, $maxChange ); next if $rands[ $target ] + $addition > $limits[ $target ]; $rands[ $target ] += $addition; $remainder -= $addition; } return @rands; }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      This is similar in some respect to the algorithm I posted, but seems to touch each number several times, adding a little bit of the remainder. I think that will tend to be somewhat slower than my algorithm, though it turns out to be stable without modification in the case where constraints are all 50 +/- 50.

      In some benchmarks I ran -- on slightly cleaned up version of my algorithm, admittedly, and incorporating the fix for pathological remainders -- I saw about a 10% gain over gen3 for the initial case of constraints. The difference was only about 5% for the pathological constraints case. (The original "gen" was substantially faster than both.)

      So, Grandfather -- you've got a couple of workable options now. In the benchmarking, all three algorithms were run several hundred thousand times (and checked against constraints each time). I'd suggesting trying them out and confirming the statistical properties of the numbers you get in case either of us introduced some bias unintentionally.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        The original "gen" was substantially faster than both.

        I mentioned that gen3 would be slower, but (I thought) fairer. This turns out not to be the case :(

        The basic algorithm is to allocated the minimum possible amount to each of the values. Subtract that from 100 to get ($remainder). Then allocate random amounts of that remainder to a random recipient until it's all gone. Skipping and trying again if the random allocated amount is greater than the randomly chosen recipient can accept.

        Whilst not actually determanistic--there is a random element after all :)--it converges to 0 reliably and very quickly always producing a correct solution. And all solutions are possible.

        The problem with the algorithm is that it is not fair. It favours some solutions over others.

        This table (split into 3), represents the results generated from a run producing 1e6 solutions to the OP problem.

        The columns represent the last variable (40-60 range). The rows are the first variable (5 - 35 range). Within the table are two values: the corresponding value for the third variable (5-55 range) on the left and the frequency that the 3-value combination was chosen.

        |40 |41 |42 |43 |44 |45 |46 --+-------+-------+-------+-------+-------+-------+------- 35|25 1388|24 1106|23 1153|22 1259|21 1441|20 1632|19 1726 34|26 1067|25 734|24 811|23 903|22 1021|21 1214|20 1384 33|27 978|26 598|25 690|24 785|23 901|22 1057|21 1201 32|28 964|27 618|26 663|25 753|24 913|23 1062|22 1176 31|29 891|28 588|27 609|26 676|25 794|24 1011|23 1123 30|30 865|29 531|28 579|27 667|26 798|25 934|24 1180 29|31 847|30 497|29 518|28 665|27 743|26 967|25 1054 28|32 790|31 506|30 511|29 614|28 745|27 866|26 1021 27|33 763|32 475|31 568|30 615|29 743|28 850|27 1031 26|34 691|33 470|32 514|31 611|30 724|29 847|28 971 25|35 735|34 452|33 468|32 585|31 683|30 874|29 962 24|36 709|35 416|34 485|33 590|32 686|31 836|30 999 23|37 616|36 403|35 430|34 526|33 667|32 778|31 953 22|38 581|37 340|36 379|35 477|34 588|33 735|32 931 21|39 500|38 335|37 400|36 447|35 572|34 672|33 867 20|40 467|39 306|38 340|37 457|36 536|35 640|34 765 19|41 412|40 268|39 280|38 354|37 443|36 582|35 663 18|42 392|41 223|40 258|39 324|38 403|37 476|36 636 17|43 335|42 231|41 216|40 304|39 340|38 485|37 558 16|44 279|43 173|42 213|41 297|40 329|39 413|38 499 15|45 315|44 185|43 209|42 258|41 314|40 437|39 521 14|46 222|45 125|44 175|43 210|42 254|41 312|40 407 13|47 186|46 104|45 125|44 179|43 208|42 298|41 320 12|48 130|47 76|46 92|45 129|44 153|43 206|42 257 11|49 99|48 63|47 80|46 98|45 120|44 139|43 186 10|50 64|49 57|48 59|47 64|46 98|45 108|44 125 9|51 50|50 36|49 47|48 55|47 77|46 88|45 129 8|52 55|51 27|50 38|49 67|48 68|47 79|46 106 7|53 47|52 25|51 25|50 38|49 44|48 70|47 81 6|54 39|53 24|52 23|51 38|50 31|49 69|48 80 5|55 52|54 38|53 36|52 40|51 65|50 77|49 130 |47 |48 |49 |50 | 51 |52 |53 --+-------+-------+-------+-------+-------+-------+------- 35|18 1986|17 2270|16 2448|15 2983|14 2530|13 2305|12 2140 34|19 1478|18 1668|17 1842|16 2025|15 2056|14 1908|13 1870 33|20 1366|19 1512|18 1682|17 1832|16 1820|15 1984|14 1915 32|21 1316|20 1531|19 1730|18 1821|17 1721|16 1681|15 2021 31|22 1376|21 1490|20 1658|19 1818|18 1700|17 1800|16 1870 30|23 1312|22 1535|21 1749|20 1824|19 1767|18 1808|17 1887 29|24 1229|23 1548|22 1657|21 1914|20 1809|19 1845|18 2009 28|25 1209|24 1431|23 1730|22 1917|21 1815|20 1937|19 1954 27|26 1207|25 1448|24 1750|23 2051|22 1807|21 1970|20 2104 26|27 1175|26 1409|25 1707|24 2047|23 1864|22 2035|21 2148 25|28 1242|27 1385|26 1702|25 2048|24 1944|23 2179|22 2237 24|29 1200|28 1343|27 1665|26 1953|25 1935|24 2100|23 2379 23|30 1163|29 1407|28 1676|27 1890|26 1836|25 2116|24 2343 22|31 1126|30 1270|29 1565|28 1851|27 1815|26 2019|25 2339 21|32 1007|31 1292|30 1507|29 1810|28 1679|27 2006|26 2100 20|33 927|32 1157|31 1425|30 1710|29 1622|28 1877|27 1989 19|34 885|33 973|32 1259|31 1550|30 1524|29 1752|28 2002 18|35 741|34 952|33 1163|32 1426|31 1461|30 1610|29 1786 17|36 722|35 878|34 1052|33 1280|32 1387|31 1450|30 1596 16|37 627|36 783|35 980|34 1172|33 1202|32 1312|31 1542 15|38 598|37 787|36 959|35 1203|34 1226|33 1401|32 1643 14|39 509|38 632|37 834|36 999|35 1000|34 1222|33 1337 13|40 442|39 560|38 648|37 784|36 787|35 921|34 1163 12|41 320|40 399|39 501|38 665|37 597|36 728|35 928 11|42 243|41 365|40 409|39 540|38 540|37 601|36 765 10|43 176|42 254|41 320|40 403|39 409|38 458|37 587 9|44 179|43 192|42 270|41 316|40 312|39 408|38 449 8|45 122|44 160|43 209|42 267|41 239|40 321|39 341 7|46 103|45 115|44 160|43 223|42 198|41 268|40 309 6|47 99|46 108|45 125|44 183|43 181|42 229|41 291 5|48 121|47 168|46 254|45 279|44 311|43 355|42 419 |54 |55 |56 |57 |58 |59 |60 --+-------+-------+-------+-------+-------+-------+------- 35|11 2090|10 2029| 9 2069| 8 1841| 7 1980| 6 2147| 5 3417 34|12 1789|11 1705|10 1584| 9 1560| 8 1646| 7 1790| 6 2455 33|13 1855|12 1712|11 1671|10 1737| 9 1742| 8 1841| 7 2437 32|14 1988|13 1969|12 1874|11 1964|10 1924| 9 2123| 8 2773 31|15 2259|14 2159|13 2160|12 2117|11 2231|10 2354| 9 3236 30|16 2051|15 2514|14 2414|13 2437|12 2578|11 2794|10 3621 29|17 2158|16 2417|15 2834|14 2782|13 2793|12 3173|11 4410 28|18 2182|17 2378|16 2613|15 3274|14 3446|13 3719|12 5159 27|19 2236|18 2357|17 2662|16 3145|15 3728|14 4315|13 5951 26|20 2309|19 2608|18 2787|17 3257|16 3715|15 4947|14 6915 25|21 2520|20 2721|19 3060|18 3294|17 3918|16 4862|15 8045 24|22 2579|21 2896|20 3257|19 3517|18 4172|17 5088|16 8058 23|23 2749|22 2956|21 3296|20 3840|19 4333|18 5447|17 8236 22|24 2623|23 2950|22 3389|21 3942|20 4579|19 5580|18 8478 21|25 2460|24 2903|23 3312|22 3697|21 4437|20 5712|19 8578 20|26 2301|25 2785|24 3189|23 3637|22 4433|21 5524|20 8819 19|27 2210|26 2541|25 3122|24 3636|23 4537|22 5554|21 8553 18|28 2027|27 2375|26 2730|25 3306|24 4288|23 5477|22 8472 17|29 1963|28 2210|27 2726|26 3177|25 3975|24 5142|23 8239 16|30 1845|29 2213|28 2564|27 2969|26 3685|25 4958|24 7985 15|31 1904|30 2142|29 2639|28 3139|27 3925|26 5044|25 7934 14|32 1579|31 1757|30 2209|29 2650|28 3260|27 4230|26 6789 13|33 1250|32 1564|31 1833|30 2170|29 2773|28 3596|27 5750 12|34 1106|33 1243|32 1555|31 1853|30 2454|29 3094|28 4962 11|35 861|34 1026|33 1208|32 1543|31 1876|30 2565|29 4114 10|36 694|35 849|34 1030|33 1266|32 1631|31 2101|30 3474 9|37 561|36 625|35 804|34 968|33 1289|32 1802|31 2889 8|38 455|37 516|36 689|35 792|34 1032|33 1400|32 2416 7|39 310|38 448|37 554|36 629|35 845|34 1190|33 2073 6|40 328|39 402|38 445|37 596|36 794|35 1003|34 1755 5|41 475|40 599|39 703|38 918|37 1180|36 1447|35 2061 651 selections were made

        As you can see, all 651 possible solutions were chosen, but their frequencies vary from as low as 23 to a high of 8819. Ideally this would be plotted as a 3D scatter plot to allow it to be visuallised correctly, but I don't have anything that will allow me to do that easily. And I couldn't post the results if I did :(

        The reason the results are skewed is that when the initial random allocation of the remainder are made, the range of random values that can be chosen is larger than the residual range of any of the values, so the largest residual will be favoured.

        The change for gen3() is to constrain the size of the random allocations from the remainder at each iteration to a size smaller than the smallest residual after their minimum allocations. This means that algorithm interates more times, but at each stage, it is more likely that whatever value is randomly chosen will be able to accept the allocation--until you reach the point where that value has reached it's upper bound.

        Unfortunately, a bug in my statistics gathering routine meant that I saw what I was hoping to see. When I correct that bug, it turns out that this slower gen3() algorithm does little better than gen().

        In summary, gen3() produces nearly identical (unfair) results to gen(), so if the fairness is not a problem, gen() is the better choice.

        If fairness is important, my other post above may be worth considering. By producing all solutions and the picking one at random, you ensure fairness, but the price is that it gets very slow as the number of values and their ranges increase. I keep thinking, and have been pursuing a solution that generates all the possible solutions without iterating all the non-possibles and discarding them.

        I am really quite sure that this is possible for a 3-value problem. And I think it should be extensible to N-values, but I've always had trouble visualising things in greater than 3 dimensions and if I cannot visualise it, I have great trouble coding it.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Need technique for generating constrained random data sets
by CountZero (Bishop) on Feb 07, 2007 at 16:34 UTC
    Grandfather,

    Your use of sd seems to indicate you are thinking of Standard Deviation for your boundaries.

    Know then that a Standard Deviation is not a hard boundary and that is perfectly OK for an individual value to be outside the mean + or - the Standard Deviation.

    Chebyshev stated that at least 50% of the values in your set will be within 1.4 standard deviations from the mean. As a corrolary, this means that upto 50% of your values may be more than 1.4 times the Standard Deviation away from your mean. It probably also means that you cannot have a flat distribution for your data if you have to simulate a certain Standard Deviation.

    But then again perhaps you did not think of Standard Deviation at all when asking this question!

    CountZero

    "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law

      I suspect that there was originally the idea of using a normal distribution, but then the switch to a flat distribution was done because of boundaries at 0 and 100.

      An alternative would be to use a bounded probability distribution like the Beta distribution or the Kumaraswamy distribution.

      I keep meaning to implement Kumaraswamy in Math::Random::OO one of these days. (Right after I fix it to use Math::Random::MT::Auto as the underlying generator and recode other transformations in XS.)

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      See Re^2: Need technique for generating constrained random data sets for the back story. The code does include an (unused) provision for using a normal distribution. However given the nature of the search and the uncertainties in some elements of the diet at least, a flat distribution is probably more appropriate than a normal distribution - not my call however.


      DWIM is Perl's answer to Gödel
Re: Need technique for generating constrained random data sets
by Moron (Curate) on Feb 08, 2007 at 11:04 UTC
    The term "random" is often not well-defined and the addition of a probability density function can complete the picture.

    For example, it could be (I can't properly identify your model from the info. so far) that the constraints should be achieved by shifting and scaling a bell curve to make it intersect y=0 exactly at the constraint points, or it might be that a Poisson distribution is required.

    Most people will by default interpret "random" as having a flat probability density function, but that can often be unsuitable for statistical models.

    Update: If its is "flat" however, you could generate the data in two steps: pick a constraint set at random and then use the constraints just chosen to randomly generate a "point" e.g.

    use integer; my @limits = ( [10,40], [20,70], ); my $pick1 = ( 1 + $#limits ) * rand() - 1; } my $pick2 = $limits[$pick1][0] + ( rand() * ( $limits[$pick1][1] - $limits[$pick +1][0] );

    -M

    Free your mind