muba has asked for the wisdom of the Perl Monks concerning the following question:

Say I have a AoA that goes something like this:

(update: I said HoA, but I meant AoA, as pointed out below by NetWallah. Or have the code slightly altered as per GrandFather's suggestion. But that really is not the point :) I'll have a look into that discussion, rhesa.)

my @array = ( ["foo", 1], ["bar", 3], ["baz", 4], ["quux", 9] );
The numbers represent how likely it is that the string gets picked - the bigger the number, the more probable it is the string gets picked. I hope I express myself comprehendable here...

Well anyway... how would I do that? I could of course just create an array that contains each string N times, N being the number (qw(foo bar bar bar baz baz baz baz quux quux quux quux quux quux quux quux quux)) but the arrays would get rather large and this way would be quite slow.

And I guess there must be alternatives... any suggestions?

Replies are listed 'Best First'.
Re: Picking a random item through probability
by rhesa (Vicar) on Nov 23, 2006 at 21:04 UTC
Re: Picking a random item through probability
by NetWallah (Canon) on Nov 23, 2006 at 21:18 UTC
    In addition to rhesa's recommendation - Your "%hash" most likely does not look the way you expect it to - There are no usable Keys, and the structure does not make logial sense for what you say you need. Here is what your structure looks like:
    $VAR1 = { 'ARRAY(0x19333b0)' => [ 'baz', 5 ], 'ARRAY(0x350c8)' => [ 'bar', 7 ] };
    Note that it has converted your Arrayref ["foo",1] into a KEY, which is probably not what you want. In any case, a hash-type structure does not seem to be called for , based on your description. I'd sugges an AoA.
    Update: AoA Recommended:
    my @Index_Is_The_Weight=( [], # Zero weight - no items ["foo"], # Only one item has weight ONE [], # No items in weight 2 ["bar"], # Three ([]) x 5, # Five empty items ["quux"] );

         "A closed mouth gathers no feet." --Unknown

      Actully a hash does make sense if it has been used to accumulate usage statistics:

      ++$hash{$_} for @stuff;

      but, as you almost suggest, the hash should look like:

      my %hash = ( foo => 1, bar => 3, baz => 4, quux => 9, );

      DWIM is Perl's answer to Gödel

        Actually, even this is problematic. Because you need to go through the list in a defined order to ensure that each item has its respective probabilities. So you really do need the array refs ... but you need them in a list, not a hash.

        my @items = ( ["foo", 1], ["bar", 3], ["baz", 4], ["quux", 9] );
        Then, when you generate a number, you first loop through summing $_->[1]:
        use List::Util; my $total_weight = List::Util::sum(map { $_->[1] } @items);
        and then you generate a number from that, and then find the item in the total:
        my $pick = int(rand($total_weight)); my $picked_item = (List::Util::first { $pick -= $_->[1]; $pick < 0 } + @items)->[0];
        Ok, that's a bit convoluted, but I don't quite want to do someone's homework ;-) Anyway, the point is that if, when iterating over the hash, you end up with the hash being reordered from lookup to lookup, I don't think you'll get precisely the correct distribution. (Of course, I'm assuming a perfectly random rand function, but that is a separate problem.)

        Update: fix some minor formatting, and off-by-one error (was $pick <= 0)

Re: Picking a random item through probability
by demerphq (Chancellor) on Nov 24, 2006 at 00:22 UTC
    my @array = ( ["foo", 1], ["bar", 3], ["baz", 4], ["quux", 9] ); @array = sort { $a->[1] <=> $b->[1] || $a->[0] cmp $b->[0]} @array; my $sum = 0; $_->[2] = ($sum += $_->[1]) for @array; my $v = int(rand $sum); my $idx = 0; $idx++ while ( $v > $array[$idx][2] ); print $array[$idx][0],"\n";

    If the list was really large I'd think about binsearching the results. And if I didn't care about memory (often there is no need to care), then I'd just duplicate the elements into a list and then use a random index into the list. For instance if the list is small there is nothing really wrong with doing the easy thing...

    my @list = map { ($_->[0]) x $_->[1] } @array; my $elem = $list[rand @list];

    (RAM is cheap.)

    ---
    $world=~s/war/peace/g

Re: Picking a random item through probability
by swampyankee (Parson) on Nov 24, 2006 at 02:09 UTC

    I actually did something like this. I had an array,

    my @prob = (0.1, 0.25, 0.03, 0.4, 0.22);
    which described the probability of one of five (in this contrived case) options.

    my @options = qw(monday tuesday wednesday thursday friday);

    My suggestion is something like this:

    #!/usr/bin/perl use strict; use warnings; my @likelihood = (1,3,4,9); my @option = qw(foo bar baz quux); my @sum; $sum[0] = $likelihood[0]; foreach my $n ( 1 .. $#likelihood){ $sum[$n] = $sum[$n - 1] + $likelihood[$n]; } # @sum = (1, 4, 8, 17); my $n = int(rand($sum[-1])+0.1); # random number <= 17 foreach my $i ( 0 .. $#sum) { if($n <= $sum[$i]){ print $option[$i] . "\n";; last; } }

    NOTE:  This code is not tested.

    emc

    At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.

    —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.

      I've done this before, too, and I'd like to point out that the addition of 0.1 and the check of <= will throw off your probabilities. It may look right, and may even suffice for what you're doing, but it won't be the precise probabilities that you fed in at the top.

      Also, I'd like to discourage the idea of separating the weights from the options. It's far, far too easy to get the quantities out of sync. Perl makes anon hashes and arrays so easy that there's no excuse to do this.

      my @likelihood = (1, 3, 4, 9, 2, 8); my @option = qw(foo bar baz quux biz);
      Oops! Maybe if we lined it up better ...
      my @likelihood = (1, 3, 4, 9, 2, 8); my @option = qw(foo bar baz quux biz);
      Now it's obvious. But what if we have 50 items? It'll scroll off the right side of the screen and be practically impossible to deal with. Better to use an AoA or AoH instead.
      my @options = ( { name => 'foo', weight => 1 }, { name => 'bar', weight => 3 }, { name => 'baz', weight => 4 }, { name => 'quux', weight => 9 }, { name => 'biz', weight => 2 }, );
      Back to your probabilities ... remembering that rand(n) gives you a random number such that 0 < rand(n) < n, theoretically with even distribution, you're getting a random number r that is 0.1 < r < n + 0.1, then truncating it. That gives you a 0.9/17 (or about 5.3%) chance of getting 0, plus a 1/17 chance (or about 5.9%) of getting a 1, both of which (total of about 11.2%) will get you 'foo' in your case. Meanwhile, 'bar', with a weight of 3, has a 3/17 chance, or 17.6%, of being selected, which does not seem to be a triple weight compared to 'foo' as one would expect. Finally, there is only a 0.1/17 chance (or about 0.6%) of getting 17, plus the other 8 numbers that get you your 'quux', which is much less than 9/17 chance.

      Best, instead, to stick to int(rand($total_weight)) and comparing < instead of <=.

        Since today is a US holiday, and I did this for work, I've not got my source code available. Even if I could remember the exact code, it's not mine to put into a public place.

        The potential values and their corresponding probabilities (the sum of which was, of course, 1) were read from an output from some SAS analyses; neither was coded into the Perl program, so alignment wasn't an issue, and there were never more than about a half-dozen possible values.

        emc

        At that time [1909] the chief engineer was almost always the chief test pilot as well. That had the fortunate result of eliminating poor engineering early in aviation.

        —Igor Sikorsky, reported in AOPA Pilot magazine February 2003.
Re: Picking a random item through probability
by BrowserUk (Patriarch) on Nov 24, 2006 at 17:52 UTC

    The nice thing about the string x N approach is that picking is O(1), but as you say, it gets unweildy and memory hungry when the sum of your integer weights get larger than a few hundred. If one or more of your weights is prime then you cannot even use scaling to reduce that consumption.

    However, at exactly the same point as it starts to become unweildy, the scan & accumulate method starts to get expensive of time/cpu.

    It seems a shame to give up the efficiency, and the code below demonstrates that you do not have to.

    It works by creating a scaled array of indexes into the data, in similar fashion to the string x N method, but storing the accumulated cusp points at appropriate places within the array. For the example, I've chosen a scale of 100 and so I build a weights table that looks like this:

    [ 0 0 0 0 0 5.88235294117647 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 23.5294117647059 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 47.0588235294118 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 100 ]

    That is, a 100 element array (+ the last one which isn't used), with the values being the indexes into the original array except at the cusp points, where I store the accumulated probability. To make a pick, I chose a random number between 0 .. 99. and use this as an index into the table above.

    If the value returned is an integer, it can be used as a direct index into the original array, giving an O(1) pick.

    If the value returned is not an integer, then it is necessary to perform a comparison between the random value chosen and the returned value (probability) to chose whether the index above or below is used.

    • If rand( 100 ) == 0, 1, 2, 3 or 4 then return weights[ $rand ].
    • If rand( 100 ) == 5, then compare rand( 100 ) <=> weights[ rand( 100 ) ] and use the index from the weight above or below as appropriate.
    • And so on for the other possible values.

    The size of the weights => index mapping array is fixed regardless of the input domain size, and it can be adjusted to minimise the occasions when the secondary decision is required. Eg. I've used a range of 100, so the secondary decision is required in 3% of picks. If I'd used 1000, then it would be 0.3% of cases.

    For the test data provided, the required probablilities are:

    foo: 5.88235294117647% bar: 17.6470588235294% baz: 23.5294117647059% quux: 52.9411764705882%

    Output from a few runs of the code below:

    c:\test>junk -N=1e2 foo frequency 4.000% bar frequency 18.000% baz frequency 23.000% quux frequency 55.000% c:\test>junk -N=1e3 foo frequency 5.300% bar frequency 15.400% baz frequency 25.600% quux frequency 53.700% c:\test>junk -N=1e4 foo frequency 5.640% bar frequency 17.690% baz frequency 23.370% quux frequency 53.300% c:\test>junk -N=1e5 foo frequency 5.961% bar frequency 17.379% baz frequency 23.683% quux frequency 52.977% c:\test>junk -N=1e6 foo frequency 5.885% bar frequency 17.662% baz frequency 23.446% quux frequency 53.007%

    As normal, the more picks made, the closer to the theoretical probabilities the actual frequencies get.

    The code

    #! perl -slw use strict; use List::Util qw[ sum ]; our $N ||= 1_000; our $SCALE ||= 100; sub pickGen { my $total = sum @_; my $accum = 0; my @weights = map{ my $fencepost = $_[ $_ ] * $SCALE / $total; ( ( $_ ) x $fencepost, $accum += $fencepost ) } 0 .. $#_; return sub { my $randValue = rand $SCALE; my $index = $weights[ $randValue ]; return $index if $index == int $index; return $weights[ int( $randValue + ( $randValue <=> $index ) ) + ] }; } my @array = ( ["foo", 1], ["bar", 3], ["baz", 4], ["quux", 9] ); my $picker = pickGen( map $_->[1], @array ); my %choices; $choices{ $array[ $picker->() ][ 0 ] }++ for 1 .. $N; printf "%10s frequency %.3f%%\n", $_, $choices{ $_ } *100 / $N for sort{ $choices{ $a } <=> $choices{ $b } } keys %choices;

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.