The classical algorithm (I think it's described by Knuth) to pick N lines from a file with M lines (N <= M) goes like this:
- Read the first N lines in a buffer.
- For each next line (say, line k), decide with chance N/k, whether to accept or reject this line. If accepted, randomly replace one of the lines in the buffer.
In Perl code, you get something like:
my @buffer;
push @buffer, scalar <IN> for 1 .. $N;
while (my $line = <IN>) {
next unless rand($.) < $N;
$buffer [rand @buffer] = $_;
}
print @buffer;
A few points:
- It assumes you have enough memory to store N lines.
- If you have enough memory to slurp in the entire file, it may be easier to read in the file in an array, shuffle the array, and print the first $N entries.
- The code as is doesn't preserve order - but you can always store the line number with the line itself, and sort afterwards.