comment on

The nice thing about the string x N approach is that picking is O(1), but as you say, it gets unweildy and memory hungry when the sum of your integer weights get larger than a few hundred. If one or more of your weights is prime then you cannot even use scaling to reduce that consumption.

However, at exactly the same point as it starts to become unweildy, the scan & accumulate method starts to get expensive of time/cpu.

It seems a shame to give up the efficiency, and the code below demonstrates that you do not have to.

It works by creating a scaled array of indexes into the data, in similar fashion to the string x N method, but storing the accumulated cusp points at appropriate places within the array. For the example, I've chosen a scale of 100 and so I build a weights table that looks like this:

[
    0 0 0 0 0 5.88235294117647 
    1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23.5294117647059 
    2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 47.0588235294118 
    3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 
    3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 100
]
[download]

That is, a 100 element array (+ the last one which isn't used), with the values being the indexes into the original array except at the cusp points, where I store the accumulated probability. To make a pick, I chose a random number between 0 .. 99. and use this as an index into the table above.

If the value returned is an integer, it can be used as a direct index into the original array, giving an O(1) pick.

If the value returned is not an integer, then it is necessary to perform a comparison between the random value chosen and the returned value (probability) to chose whether the index above or below is used.

If rand( 100 ) == 0, 1, 2, 3 or 4 then return weights[ $rand ].
If rand( 100 ) == 5, then compare rand( 100 ) <=> weights[ rand( 100 ) ] and use the index from the weight above or below as appropriate.
And so on for the other possible values.

The size of the weights => index mapping array is fixed regardless of the input domain size, and it can be adjusted to minimise the occasions when the secondary decision is required. Eg. I've used a range of 100, so the secondary decision is required in 3% of picks. If I'd used 1000, then it would be 0.3% of cases.

For the test data provided, the required probablilities are:

  foo: 5.88235294117647%
  bar: 17.6470588235294%
  baz: 23.5294117647059%
 quux: 52.9411764705882%
[download]

Output from a few runs of the code below:

c:\test>junk -N=1e2
       foo frequency 4.000%
       bar frequency 18.000%
       baz frequency 23.000%
      quux frequency 55.000%

c:\test>junk -N=1e3
       foo frequency 5.300%
       bar frequency 15.400%
       baz frequency 25.600%
      quux frequency 53.700%

c:\test>junk -N=1e4
       foo frequency 5.640%
       bar frequency 17.690%
       baz frequency 23.370%
      quux frequency 53.300%

c:\test>junk -N=1e5
       foo frequency 5.961%
       bar frequency 17.379%
       baz frequency 23.683%
      quux frequency 52.977%

c:\test>junk -N=1e6
       foo frequency 5.885%
       bar frequency 17.662%
       baz frequency 23.446%
      quux frequency 53.007%
[download]

As normal, the more picks made, the closer to the theoretical probabilities the actual frequencies get.

The code

#! perl -slw
use strict;
use List::Util qw[ sum ];

our $N ||= 1_000;
our $SCALE ||= 100;

sub pickGen {
    my $total = sum @_;
    my $accum = 0;
    my @weights = map{
        my $fencepost = $_[ $_ ] * $SCALE / $total;
        ( ( $_ ) x $fencepost, $accum += $fencepost )
    } 0 .. $#_;
    return sub {
        my $randValue = rand $SCALE;
        my $index = $weights[ $randValue ];
        return $index if $index == int $index;
        return $weights[ int( $randValue + ( $randValue <=> $index ) )
+ ]
    };
}

my @array = (
    ["foo",  1],
    ["bar",  3],
    ["baz",  4],
    ["quux", 9]
);

my $picker = pickGen( map $_->[1], @array );

my %choices;
$choices{ $array[ $picker->() ][ 0 ] }++ for 1 .. $N;

printf "%10s frequency %.3f%%\n", $_, $choices{ $_ } *100 / $N
    for sort{ $choices{ $a } <=> $choices{ $b } } keys %choices;
[download]

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Picking a random item through probability by BrowserUk
in thread Picking a random item through probability by muba

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.