The nice thing about the string x N approach is that picking is O(1), but as you say, it gets unweildy and memory hungry when the sum of your integer weights get larger than a few hundred. If one or more of your weights is prime then you cannot even use scaling to reduce that consumption.
However, at exactly the same point as it starts to become unweildy, the scan & accumulate method starts to get expensive of time/cpu.
It seems a shame to give up the efficiency, and the code below demonstrates that you do not have to.
It works by creating a scaled array of indexes into the data, in similar fashion to the string x N method, but storing the accumulated cusp points at appropriate places within the array. For the example, I've chosen a scale of 100 and so I build a weights table that looks like this:
[ 0 0 0 0 0 5.88235294117647 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23.5294117647059 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 47.0588235294118 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 100 ]
That is, a 100 element array (+ the last one which isn't used), with the values being the indexes into the original array except at the cusp points, where I store the accumulated probability. To make a pick, I chose a random number between 0 .. 99. and use this as an index into the table above.
If the value returned is an integer, it can be used as a direct index into the original array, giving an O(1) pick.
If the value returned is not an integer, then it is necessary to perform a comparison between the random value chosen and the returned value (probability) to chose whether the index above or below is used.
The size of the weights => index mapping array is fixed regardless of the input domain size, and it can be adjusted to minimise the occasions when the secondary decision is required. Eg. I've used a range of 100, so the secondary decision is required in 3% of picks. If I'd used 1000, then it would be 0.3% of cases.
For the test data provided, the required probablilities are:
foo: 5.88235294117647% bar: 17.6470588235294% baz: 23.5294117647059% quux: 52.9411764705882%
Output from a few runs of the code below:
c:\test>junk -N=1e2 foo frequency 4.000% bar frequency 18.000% baz frequency 23.000% quux frequency 55.000% c:\test>junk -N=1e3 foo frequency 5.300% bar frequency 15.400% baz frequency 25.600% quux frequency 53.700% c:\test>junk -N=1e4 foo frequency 5.640% bar frequency 17.690% baz frequency 23.370% quux frequency 53.300% c:\test>junk -N=1e5 foo frequency 5.961% bar frequency 17.379% baz frequency 23.683% quux frequency 52.977% c:\test>junk -N=1e6 foo frequency 5.885% bar frequency 17.662% baz frequency 23.446% quux frequency 53.007%
As normal, the more picks made, the closer to the theoretical probabilities the actual frequencies get.
The code
#! perl -slw use strict; use List::Util qw[ sum ]; our $N ||= 1_000; our $SCALE ||= 100; sub pickGen { my $total = sum @_; my $accum = 0; my @weights = map{ my $fencepost = $_[ $_ ] * $SCALE / $total; ( ( $_ ) x $fencepost, $accum += $fencepost ) } 0 .. $#_; return sub { my $randValue = rand $SCALE; my $index = $weights[ $randValue ]; return $index if $index == int $index; return $weights[ int( $randValue + ( $randValue <=> $index ) ) + ] }; } my @array = ( ["foo", 1], ["bar", 3], ["baz", 4], ["quux", 9] ); my $picker = pickGen( map $_->[1], @array ); my %choices; $choices{ $array[ $picker->() ][ 0 ] }++ for 1 .. $N; printf "%10s frequency %.3f%%\n", $_, $choices{ $_ } *100 / $N for sort{ $choices{ $a } <=> $choices{ $b } } keys %choices;
In reply to Re: Picking a random item through probability
by BrowserUk
in thread Picking a random item through probability
by muba
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |