in reply to Need technique for generating constrained random data sets

my @constraints = ( {mid => 20, sd => 15}, {mid => 30, sd => 25}, {mid => 50, sd => 10}, );

As others have implied, I don't think you're going to be able to do this deterministically, as you have over-specified your solution set. The sum of your maximums is greater than 100 and the sum of your minimums is less than 100.

What comes to mind to minimize the number of discarded sets and avoid scaling is to pick the numbers in descending order of midpoint. After generating each number, check if the remainder is more than the sum of the minimums remaining and less than the sum of the maximums remaining. If it's outside the bounds, repick the last number. When you're down to the last number, you don't generate it randomly, it's just the remainder.

Basically, you're pruning off choices that can't satisfy the remaining constraints.

Without thinking it through further, I'm worried that doing it in descending order of midpoint might bias the results. I think you could pick them in random order and you're just more likely to have to repick numbers along the way.

Update: Here's a code sample:

use strict; use warnings; sub RandFlat { #Return a rand () value with a flat distribution about the $mean + +- $stdDev my ($mean, $stdDev) = @_; my $range = 2.0 * $stdDev; my $value = rand ($range); return $value + ($mean-$stdDev); } sub generate_set { my ($remainder, @constraints) = @_; my @results; my ($sum_of_minima, $sum_of_maxima); for my $c ( @constraints ) { $sum_of_minima += $c->{mid} - $c->{sd}; $sum_of_maxima += $c->{mid} + $c->{sd}; } # iterate through N-1 constraints in descending order my @descending = sort { $b->{mid} <=> $a->{mid} } @constraints; my $last_value = pop @descending; for my $c ( @descending ) { my $n = RandFlat( $c->{mid}, $c->{sd} ); # repeat if remainder outside sum of the remaining # minima and maxima contraints my $new_remainder = $remainder - $n; my $new_sum_of_minima = $sum_of_minima - ( $c->{mid} - $c->{sd +} ); my $new_sum_of_maxima = $sum_of_maxima - ( $c->{mid} + $c->{sd +} ); redo if ( $new_remainder < $new_sum_of_minima) || ( $new_remainder > $new_sum_of_maxima); # otherwise save number and update the remainder and constrain +ts push @results, [ $c->{mid}, $c->{sd}, $n ]; $remainder = $new_remainder; $sum_of_minima = $new_sum_of_minima; $sum_of_maxima = $new_sum_of_maxima; } # the remainder must now satisfy the final constraint return @results, [ $last_value->{mid}, $last_value->{sd}, $remaind +er ]; } my $total = 100; my @constraints = ( {mid => 20, sd => 15}, {mid => 30, sd => 25}, {mid => 50, sd => 10}, ); for ( 1 .. 5 ) { for my $result ( generate_set($total, @constraints) ) { my ($mid, $sd, $value) = @$result; printf "%5.1f +-%5.1f: %5.1f", $mid, $sd, $value; print " - bad" if ($value < ($mid - $sd)) || ($value > ($mid + + $sd)); print "\n"; } print "\n"; }

I ran it 1000 times and didn't see any bad results.

Also, thinking it through again, I don't think the descending order will wind up biased (but I could be convinced by a good argument).

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Replies are listed 'Best First'.
Re^2: Need technique for generating constrained random data sets
by GrandFather (Saint) on Feb 08, 2007 at 10:10 UTC

    Some such approach as this had occured to me already, but I couldn't get my head around implementing it. The comment regarding the "over-specified solution set" is helpfull in my realising why I'm having trouble with this.

    It's not clear to me that your code would be happy with a set of constraints such as:

    my @constraints = ( {mid => 50, sd => 50}, {mid => 50, sd => 50}, {mid => 50, sd => 50}, );

    See Re^2: Need technique for generating constrained random data sets for some of the background.


    DWIM is Perl's answer to Gödel

      I just tried it and it works fine:

      50.0 +- 50.0: 43.1 50.0 +- 50.0: 18.8 50.0 +- 50.0: 38.2 50.0 +- 50.0: 32.8 50.0 +- 50.0: 13.4 50.0 +- 50.0: 53.8 50.0 +- 50.0: 67.6 50.0 +- 50.0: 25.2 50.0 +- 50.0: 7.2 50.0 +- 50.0: 76.4 50.0 +- 50.0: 10.9 50.0 +- 50.0: 12.7 50.0 +- 50.0: 89.9 50.0 +- 50.0: 3.1 50.0 +- 50.0: 7.0

      There is risk of getting trapped and not converging on an answer. E.g. if the first number is 99, then the second number will still be randomly generated between 0 and 100 and will be thrown out repeatedly unless it's less than 1. Ditto for the third number but in an even tighter range.

      However, I think that can be avoided by constraining each random number to be in the range from (mid - sd) to min(mid + sd, remainder). In the example above, the second number would be chosen between 0 and 1. Assuming that was 0.6, then the third number would be chosen between 0 and 0.4. That would be guaranteed to converge quickly.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        Getting trapped was my concern. Now you just need to convince me that the additional constraint on later calculated values doesn't bias the result. I realise that where constraints are the same you can randomise the value calculation order to distribute the bias among the values from set to set. I presume however bias would still be a problem if the constraints were similar, but different? If randomising the value calculation order were used, would that generate bias in the generated sets?

        It may be that I need to change technique depending on the disposition of the constraints. Use the original technique where the probability of choosing a good set randomly is high and switch to this technique (or kyle's code?) for troublesome constraint sets.


        DWIM is Perl's answer to Gödel