Re: Efficiently selecting a random, weighted element

This is just a quick thought, but it would be fast. How about estmating an average value for the (bytes/word) in the files. Once you have that, just stat the files for filesize, then obtain a number $n = ( $filesize / $ave_bytes_per_word).

Then push the name of the file, $n times into a selection array. When done filling the processing array, just randomly select from the array.

I'm not really a human, but I play one on earth. Cogito ergo sum a bum

Comment on Re: Efficiently selecting a random, weighted element

Replies are listed 'Best First'.
Re^2: Efficiently selecting a random, weighted element by jimt (Chaplain) on Oct 10, 2006 at 16:40 UTC
The average value (bytes/word) only optimizes out the word count step, which can be done fairly efficiently already (shell out to wc, if nothing else). Besides, for our purposes, averages weren't sufficient - it had to be exact. From that point on, this approach is simply the second one I had detailed with the "aaa...b...cc..." array and it suffers from the same scalability problems. You've got 100 files with 3,000 words each and a 300,000 element array. My method ends up with a 100 element array. You'll note that I didn't say "quickly", I said "efficiently". I was optimizing not only for execution time, but for memory storage. But yes, for smaller data sets, simply repeating the stored value n times is sufficient.	[reply]

Replies are listed 'Best First'.

Re^2: Efficiently selecting a random, weighted element
by jimt (Chaplain) on Oct 10, 2006 at 16:40 UTC

The average value (bytes/word) only optimizes out the word count step, which can be done fairly efficiently already (shell out to wc, if nothing else). Besides, for our purposes, averages weren't sufficient - it had to be exact.

From that point on, this approach is simply the second one I had detailed with the "aaa...b...cc..." array and it suffers from the same scalability problems. You've got 100 files with 3,000 words each and a 300,000 element array. My method ends up with a 100 element array.

You'll note that I didn't say "quickly", I said "efficiently". I was optimizing not only for execution time, but for memory storage. But yes, for smaller data sets, simply repeating the stored value n times is sufficient.

[reply]