in reply to Re: Weighted random selection
in thread Weighted random selection

Isn't this pretty much what I had in the first place?

Replies are listed 'Best First'.
Re^3: Weighted random selection
by BrowserUk (Patriarch) on Jul 29, 2008 at 00:57 UTC
    Isn't this pretty much what I had in the first place?

    Having re-inspected your code, yes. Kinda. I didn't recognise as such the first time I looked.

    That does raise a question though. Why do you think it is inefficient?

    The cost of building @choices (your @rndselect) is a one-off up front cost. And hardly onorous at that unless you have thousand or millions of choices.

    The actual runtime cost: $choices[ rand @choices ]; is as minimal as it is possible to be, and totally fair. One random number generation and one access to the array.

    The only real alternative to it (for fairness) is the Knuth algorithm and from memory that requires you to:

    1. Pick a random number;
    2. traverse (on average) half the weights array (or hash) accumulating the weights until the sum is greater than the random number you picked.
    3. Then obtain the number (either from a parallel array or the hash).

    So O(N(/2)) compared to O(1).

    If you have sufficient choices that the memory for the array is a concern, then pack indexes into a string and cut your memory requirement to 1/8th.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      The cost of building @choices (your @rndselect) is a one-off up front cost. And hardly onorous at that unless you have thousand or millions of choices.

      And impossible to do well if your weights are sqrt(2), sqrt(5) and sqrt(11).

      [Knuth vs BrowserUK] So O(N(/2)) compared to O(1).

      Not quite a fair comparison, is it? You give yourself a preprocessing complexity of Ω(R) (in both time and memory), where R is the sum of all weights (and could hence be much larger than N), while you don't give Knuth any.

        And impossible to do well if your weights are sqrt(2), sqrt(5) and sqrt(11).

        Not at all. It just requires a little math arithmetic that even I'm capable of:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; our $N ||= 1e4; sub pickGen { my $scale = shift; my $ref = @_ > 1 ? \@_ : $_[ 0 ]; my $total = 0; $total += $_ for @$ref; my $pickStick = pack 'C*', map{ ($_) x ( $scale * $ref->[ $_ ] * 10 / $total ) } 0 .. $#$ref; return sub { return ord substr $pickStick, rand length $pickStick, 1; }; } my @items = 'A' ..'C'; ## Your "difficult" weights: my @weights = map sqrt $_, 2, 5, 11; ## Only required for testing my $total = 0; $total += $_ for @weights; print "Required weights(%):\n\t", join"\t", map $_ *100 / $total, @w +eights; for my $scale ( 1, 10, 100 ) { ## Generate a picker (cost amortized over $N uses) my $picker = pickGen $scale, @weights; ## use it $N times my %stats; $stats{ $items[ $picker->() ] }++ for 1 .. $N; ## Check the results print "\nResults for $N picks (scale:$scale)\n\t", join "\t\t\t", map $stats{ $_ } *100 / $N, @items; }

        Produces these results over 1000 and 1 million picks.

        c:\test>weightedPick.pl -N=1e3 Required weights(%): 20.2990178902944 32.0955653989182 47.60541671078 +74 Results for 1e3 picks (scale:1) 21.5 34 44.5 Results for 1e3 picks (scale:10) 19.4 32.5 48.1 Results for 1e3 picks (scale:100) 19.5 29.5 51 c:\test>weightedPick.pl -N=1e6 Required weights(%): 20.2990178902944 32.0955653989182 47.60541671078 +74 Results for 1e6 picks (scale:1) 22.2248 33.3686 44.4066 Results for 1e6 picks (scale:10) 20.2238 32.3209 47.4553 Results for 1e6 picks (scale:100) 20.1913 32.0977 47.711

        By all means do more thorough testing if you think it is warrented.

        Knuth vs BrowserUK

        Hardly my algorithm since the OP was already using it. I might have reinvented it for myself, but it's not rocket science.

        For Knuth's algorithm, there is no pre-processing needed (or that can be done).

        For the pre-processing involved in my(*) algorithm. Given that the second test above does 3 preprocesses and 3 million picks in < 4 seconds, unless you are doing millions of picks (over which the preprocessing cost will amortised), then there is no point in worrying about efficiency. And once so amortised, the (R < sum weights due to scaling) costs is fractional.

        Even with the OPs scenario where he is perhaps only making one pick per (cgi) run--if so, what does efficiency of one pick mean relative to the start-up and network costs anyway?--then there is no need to do the pre-processing for every run. It only need be done when the weights change.

        The item list & weights must be loaded in from somewhere--eg. a file or DB--and the pre-processing could be..um, well...pre-processed. And the items read in as a pre-expanded list or string. So the pre-processing again gets amortised over whatever number of picks occur between weight changes.

        Either way you slice it, if efficiency is really a consideration, then an efficient implementation of the my(*) algorithm will beat Knuth's (for this use).

        (*) That's very tongue in cheek!


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.