It's definitely a tradeoff of memory vs cpu. For the actual work problem that inspired this, I used an offset index to be reused in mod_perl. Still, this solution might be useful if your file has so many lines that even the index is too large, or if you're simply looking for Another Way To Do It.
Basically, it's the same as the cookbook one. For each line you pass, the probability that you pick it up and assign it to a slot is size of the sample divided by the number of lines seen so far. If you do pick it up, you need to decide which element you currently have to throw out, so you pick a random offset based on the size of sampling.
I think this works out mathmatically (check, anyone?), and only needs special case handling for the first N lines, which you need to keep all of. I thought of just reading in N lines to begin with, but that loses some randomness in order that would have to be handled with a shuffle at the end. I actually decided to go with some extra handling in the algorithm, but it does add some overhead so I'm still considering the alternative.
use strict; my $size = shift @ARGV || 3; # sample size my @sample; my $taken = 0; # for making sure we get the first $size lines while( my $line = <> ) { chomp $line; if ( rand(1) < ($size/$.) ){ my $position; do{ $position = int rand($size); } while( $taken < $size && $sample[$position] ); $sample[$position] = $line; $taken++; } } for( my $i = 0; $i < @sample; $i++ ){ print "[$i] '$sample[$i]'\n"; } print "\n";
In reply to selecting N random lines from a file in one pass by seuratt
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |