Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Efficiently selecting a random, weighted element

by jimt (Chaplain)
on Oct 10, 2006 at 15:53 UTC ( [id://577433]=CUFP: print w/replies, xml ) Need Help??

This problem was originally presented to me by a co-worker as such,

I have a set of 100 files, and I want to randomly choose 5 of them. However, I want to weight the selection of each file based upon the number of words in the file. More words == greater chance it will be randomly selected. The files could contain between 100 - 3,000 words. What's a good way to do this?

First of all, the approach I'm going to detail here is specific to this example, but can easily be adapted to any random selection of weighted values. It should scale very easily to very large data sets (number of files, in this case), with very large weighting information (number of words, in this case). This post is mostly pseudo-code and explanation, not a functional piece of code. This solution has probably been created by other people before.

Okay, for sake of example, we're going to start off with 5 files.

my @files = qw(file_a.txt file_b.txt file_c.txt file_d.txt file_e.txt) +;

And we want to randomly choose 2 of the files. But, we want to weight our selections based off of the number of words in the file. More words in a file == more likely the file will be chosen.

Your first thought might be to build an array of all of the words in all of the files, then pick a random index, and determine which file the word is in. Note - you would need to pre-cache which word at which index is associate with which file. For example, the word "the" could appear at file_a.txt or file_b.txt. So you can't just randomly choose index 3, see "the" there, and know which file it's in. You have to know that index 3 => the => file_a.txt.

This is the first optimization. The words don't matter, only which file they're in. So instead of storing the word at each point, you can just store the file name. At this point, you'll need to be able to count the number of words in each file to get your weighting information. This is left as an exercise to the reader - use your favorite word counting widget. For example's sake, we'll say you end up with this structure:

my %words_in_files = ( 'file_a.txt' => 10, 'file_b.txt' => 1, 'file_c.txt' => 3, 'file_d.txt' => 5, 'file_e.txt' => 10 );

Now you can build up an array where the first 10 elements are "file_a.txt", the next one is "file_b.txt" and so on. For simplicity's sake, we'll display each file as its trailing letter ("file_a.txt" becomes "a"). This way, we can see our data:

aaaaaaaaaabcccdddddeeeeeeeeee

This approach is fully functional, but doesn't scale well. In our original problem, we had 100 files, with up to 3,000 words each. This is potentially a 300,000 element array, and it's just gonna get bigger if you add more files or words within the files. Don't get me wrong - perl can do it, but there's a better way.

The key is to realize that most of the information in that array is redundant. We're storing "a" 10 times. Do we really need to? Instead, we'll build a different data structure. In this structure, we'll store the file name, and the index at which the file begins. Externally, we'll also store a count of the total number of words. We end up with a structure along these lines:

my @indexes_of_files = ( # terminology: index 0 == "file offset, index 1 == "file name" [ qw( 0 file_a.txt ) ], [ qw( 10 file_b.txt ) ], [ qw( 11 file_c.txt ) ], [ qw( 14 file_d.txt ) ], [ qw( 19 file_e.txt ) ], ); my $total_number_of_words = 29;

Feel free to use hashrefs instead of arrayrefs, they may be easier to work with. I used arrayrefs here for simplicity of code display in the example. Note that the order of the files in this data structure is arbitrary. Whatever order the files are assigned in this array is irrelevant, so long as their file offsets change as appropriate.

We now have a much more compact data structure. Our algorithm is easy - generate a random integer between 0 and $total_number_of_words - 1 (0..28, in this case). Let's say that we generated "15".

Next, you need to search through the @indexes_of_files array to find the greatest file offset that's less than our generated number. Since the file information is in sorted order, a binary search can zip through the data in no time. Implementing the binary search (or whatever) algorithm is another exercise left to the reader.

However you find your data, you'll discover that you're looking at array element 4, which has file offset 14, corresponding to "file_d.txt". Note that if you count off 15 ticks into the "aaaa...b...cc..." array drawn out above, you'll also reach a "d", corresponding to file_d.txt.

You have now successfully chosen your first file, so you need to set up for subsequent ones. This is a 3 step op. One is easy, one is expensive, and one is tedious. First, the easy step.

Subtract from $total_number_of_words the length of the file you just chosen. In this case, file_d.txt has a length of 5 words, so $total_number_of_words becomes 24. You can re-calculate this length now using the the index of the element you were at and the index of the next element, you can cache it into the data structure, you can look up in the hash created earlier. Dealer's choice. But you need the length.

The "expensive" operation is simply to splice out the element at index [4].

Finally, for all elements >= the one you've removed ([4]), subtract from their file offset the length of the file just removed (5, in this case). You'll end up with this data structure when you're done:

my @indexes_of_files = ( # terminology: index 0 == "file offset, index 1 == "file name" [ qw( 0 file_a.txt ) ], [ qw( 10 file_b.txt ) ], [ qw( 11 file_c.txt ) ], # THIS FILE WAS REMOVED [ qw( 14 file_d.txt ) ], [ qw( 14 file_e.txt ) ], #this file offset was 19, is now 14. ); my $total_number_of_words = 24; #previously 29

And blam-o, you're set up to choose your next file, it's as if file_d.txt never existed, and you can repeat ad infinitum until you've selected enough files out.

Considerations

  • With tremendously long lists of data (for this example, say thousands or millions of files), you need to do a splice on a large array, and then run through all higher elements to do a subtraction. You're just iterating over an array and doing a subtraction on an integer, but it's still O(n). There may be fancier ways to do this w/o splicing or changing offsets, but I was unable to come up with an elegant one. They all seemed fragile and complicated relative to just decrementing the indexes. YMMV.
  • This can be applied to any set of data that you need to randomly select a value from based on a weighting value.
  • For smaller data sets, it may be simpler to just use the "aaa...b...cc..." approach of an array that lists all the file names. But it doesn't scale as well.
  • There may be something that does this efficiently already on CPAN.

Replies are listed 'Best First'.
Re: Efficiently selecting a random, weighted element
by xdg (Monsignor) on Oct 10, 2006 at 18:11 UTC
    And blam-o, you're set up to choose your next file, it's as if file_d.txt never existed, and you can repeat ad infinitum until you've selected enough files out

    Depending on the number of files, what about just repetitively picking additional files until you get something different from the first? Your "pick" algorithm is fast, so why splice out a file and recompute offsets each time?

    If you're picking a high percentage of the total files, then you'll be doing lots of useless picks of files already chosen, but if you're picking 2 of 300 files, it should work pretty well.

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      I like this approach, but I'll need to think it over. I may even get off my butt and write some code to actually benchmark it.

      My concern is that the chances of collisions vary not only with the number of files, but how they're waited. To use a contrived example, say we are picking 2 out of 300 files, but 1 of those files is weighted to contain 98% of the hitspace? You'll probably pick that the first time, and then re-pick it quite a bit until you actually successfully get something else.

      But for general use, this could definitely be an improvement. Maybe a hybrid approach - pick a file, and if it contains below a certain percentage of the hitspace, then just leave it alone and continue. If it is above a certain percentage, then splice it out (keeping in mind that you'd need to re-calculate previously saved indexes. If you keep index 4 flagged as one to skip over, and then you remove index 3, you need to change your flag to ignore index 3 instead of index 4, and so on).

        Like almost any algorithm, it all depends on the exact nature of the problem space.

        One refinement, if you really want to consider splicing out high-weight elements, is to create your array in sorted order so that all your highest weight files are at the end of the array. That will decrease the amount of recalculation necessary if you choose to drop them.

        In the extreme case, if the highest weight word is chosen, you just pop the last element of the array and decrease the word count and you're done. If the second highest weight word is chosen, you splice out that word and have only one index to recalculate. Etc.

        -xdg

        Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

      I think xdg's suggestion is reasonable.
Re: Efficiently selecting a random, weighted element
by Limbic~Region (Chancellor) on Oct 10, 2006 at 17:50 UTC
Re: Efficiently selecting a random, weighted element
by Skeeve (Parson) on Oct 16, 2006 at 13:08 UTC

    I'd not store the index, where the file begines, but the number of words in it.

    Searching is then done by substracting the number of words from the random number.

    As a final step, the file found is swapped with the last file and the number of words is substracted.

    #!/usr/bin/perl use strict; use warnings; my @number_of_words = ( { number => 10, name => 'file_a.txt' }, { number => 1, name => 'file_b.txt' }, { number => 3, name => 'file_c.txt' }, { number => 5, name => 'file_d.txt' }, { number => 10, name => 'file_e.txt' }, ); my $total_number_of_words = 29; my $select= 3; my $selected= 0; while ( $select-- ) { my $randomindex= rand $total_number_of_words; my $last_index= ( scalar @number_of_words ) - $selected - 1; my $i= $last_index+1; do { $randomindex-= $number_of_words[--$i]->{'number'}; } while ( $randomindex >= 0 ); $total_number_of_words-= $number_of_words[$i]->{'number'}; # print $number_of_words[$i]->{'name'},"\n"; @number_of_words[$i, $last_index]= @number_of_words[$last_index, $ +i]; ++$selected; } for (my $i= $#number_of_words - $selected; ++$i<=$#number_of_words;) { print $number_of_words[$i]->{'name'},"\n"; }
    As you can see, your selected files are now in the last n positions of the array.


    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Efficiently selecting a random, weighted element
by zentara (Archbishop) on Oct 10, 2006 at 16:17 UTC
    This is just a quick thought, but it would be fast. How about estmating an average value for the (bytes/word) in the files. Once you have that, just stat the files for filesize, then obtain a number $n = ( $filesize / $ave_bytes_per_word).

    Then push the name of the file, $n times into a selection array. When done filling the processing array, just randomly select from the array.


    I'm not really a human, but I play one on earth. Cogito ergo sum a bum

      The average value (bytes/word) only optimizes out the word count step, which can be done fairly efficiently already (shell out to wc, if nothing else). Besides, for our purposes, averages weren't sufficient - it had to be exact.

      From that point on, this approach is simply the second one I had detailed with the "aaa...b...cc..." array and it suffers from the same scalability problems. You've got 100 files with 3,000 words each and a 300,000 element array. My method ends up with a 100 element array.

      You'll note that I didn't say "quickly", I said "efficiently". I was optimizing not only for execution time, but for memory storage. But yes, for smaller data sets, simply repeating the stored value n times is sufficient.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://577433]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-20 09:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found