Re: Picking Random Lines from a File

The classical algorithm (I think it's described by Knuth) to pick N lines from a file with M lines (N <= M) goes like this:

Read the first N lines in a buffer.
For each next line (say, line k), decide with chance N/k, whether to accept or reject this line. If accepted, randomly replace one of the lines in the buffer.

In Perl code, you get something like:

my @buffer;
push @buffer, scalar <IN> for 1 .. $N;
while (my $line = <IN>) {
    next unless rand($.) < $N;
    $buffer [rand @buffer] = $_;
}
print @buffer;
[download]

A few points:

It assumes you have enough memory to store N lines.
If you have enough memory to slurp in the entire file, it may be easier to read in the file in an array, shuffle the array, and print the first $N entries.
The code as is doesn't preserve order - but you can always store the line number with the line itself, and sort afterwards.

Comment on Re: Picking Random Lines from a File Download Code

Replies are listed 'Best First'.
Re^2: Picking Random Lines from a File by nathanroy (Initiate) on Apr 21, 2009 at 20:20 UTC
Hi, I am fairly new to PERL, and I want to randomly select set of 4 lines chunk from a large file. I was looking at this tread, I am able to select randomly N lines, but I wanted to little bit more select a random line number(an odd number) and then select three lines following it and then select another random line (an odd number) and then select three lines following it and so on till N Any help would be greatly appreciated Thanks	[reply]

Replies are listed 'Best First'.

Re^2: Picking Random Lines from a File
by nathanroy (Initiate) on Apr 21, 2009 at 20:20 UTC

Hi, I am fairly new to PERL, and I want to randomly select set of 4 lines chunk from a large file. I was looking at this tread, I am able to select randomly N lines, but I wanted to little bit more select a random line number(an odd number) and then select three lines following it and then select another random line (an odd number) and then select three lines following it and so on till N Any help would be greatly appreciated Thanks

[reply]