in reply to Re^3: Need technique for generating constrained random data sets
in thread Need technique for generating constrained random data sets

I'm not really sure that there is a clear sense of what "fairness" means here. These variable are negatively correlated -- if one if them is high in its range, the others have to be lower in their ranges so it all sums to 100.

For example, when the sum of midpoints exceeds 100, there will be more combinations that work when variables are below their midpoints.

Put differently, one definition of fairness is that all potential combinations are equally likely. Another definition would take into account the implications of the overlapping ranges.

Our algorithms perform quite differently as a result. Here are the means for each variable from our two algorithms:

# Original constraints (xdg algorithm) Expected: 20.0 30.0 50.0 Means 19.5 30.5 50.0 # Pathological constraints (xdg algorithm) Expected: 50.0 50.0 50.0 Means 49.5 24.8 25.7 # Original constraints (gen3 algorithm) Expected: 20.0 30.0 50.0 Means 22.6 23.2 54.2 # Pathological constraints (gen3 algorithm) Expected: 50.0 50.0 50.0 Means: 33.5 33.4 33.1

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Replies are listed 'Best First'.
Re^5: Need technique for generating constrained random data sets
by BrowserUk (Patriarch) on Feb 08, 2007 at 18:57 UTC

    Using the example of two 6-sided dice. If we set a limit of 10 for the sum, then some values will be rejected. Which I think is analogous to this problem. Accumulating the frequency for each value by simply rejecting those over our limit still gives a flat distribution:

    #! perl -slw use strict; use List::Util qw[ sum ]; my %stats; for( 1 .. $ARGV[ 0 ] || 1e3 ) { my @dice = map 1 + int( rand 6 ), 1 .. 2; next if sum( @dice ) > 10; ++$stats{ "@dice" }; } my $mean = sum( values %stats ) / scalar values %stats; printf "%12s %d dev( %d )\n", $_, $stats{ $_ }, abs( $stats{ $_ } - $mean ) for map unpack( 'x[C2] A*', $_ ), sort map pack( 'C2 A*', ( reverse split( ' ', $_ ) ), $_ ), keys %stats; __END__ C:\test>junk7 1e6 1 1 27534 dev( 247 ) 2 1 27640 dev( 141 ) 3 1 27742 dev( 39 ) 4 1 27878 dev( 96 ) 5 1 28161 dev( 379 ) 6 1 27556 dev( 225 ) 1 2 28018 dev( 236 ) 2 2 27426 dev( 355 ) 3 2 27650 dev( 131 ) 4 2 28099 dev( 317 ) 5 2 27949 dev( 167 ) 6 2 27607 dev( 174 ) 1 3 27752 dev( 29 ) 2 3 28033 dev( 251 ) 3 3 27767 dev( 14 ) 4 3 27730 dev( 51 ) 5 3 27802 dev( 20 ) 6 3 27760 dev( 21 ) 1 4 27791 dev( 9 ) 2 4 28065 dev( 283 ) 3 4 27894 dev( 112 ) 4 4 27487 dev( 294 ) 5 4 27778 dev( 3 ) 6 4 27935 dev( 153 ) 1 5 27708 dev( 73 ) 2 5 27668 dev( 113 ) 3 5 27854 dev( 72 ) 4 5 27810 dev( 28 ) 5 5 27654 dev( 127 ) 1 6 27600 dev( 181 ) 2 6 27700 dev( 81 ) 3 6 28029 dev( 247 ) 4 6 27717 dev( 64 )

    Whether a flat distribution is correct for this application is GrandFather's call. But even if some other distribution (poisson or whatever), is required, it would be much easier to arrive at that once you have a generator that produces a known distribution. I'm also unsure how you extend those other distributions to 4 or more dimensions?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      (Grrr. Lost a response. Shorter version follows)

      I think this is a great analogy for illustrating the problem. However, consider the case where the sum must equal ten. There are only three options (4/6, 5/5, 6/4) and they occur with equal probability. However, we only see 4, 5 and 6, whereas the assumption was that the inputs could be evenly distributed between 1 and 6.

      Does that matter? That's a question that requires the original problem context -- as you pointed out.

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        The original problem boils down to randomly sampling a 7 dimensional search space with a small amount of information that suggests some corners don't need to be looked in. The space is presumed to be too big and possibly chaotic to be amenable to being searched using a grid or gradient type search for example, so the adopted technique used throws a bunch of darts and hopes some will hit interesting mountains.

        For example it may be known that shell fish can not comprise more than 70% of a diet because that causes death in short order (value invented BTW, effect real). But there is no information to say what the probability function is so a flat distribution is used for the search.


        DWIM is Perl's answer to Gödel