Picking Random Lines from a File

de2425 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Picking Random Lines from a File by bart (Canon) on Oct 09, 2008 at 18:45 UTC
I am getting output but I'm getting 300 lines that are exactly the same. `while (<IN>){ while ($count<=$size){ rand($.)<1 && ($line=$_); print OUT $line; $count++; } }` [download] Of course you are. For each line in the file, $_ has a specific value. After that, for 300 times, you decide, based on a random number, to assign this one value to $line. Always the exact same string. And then you print it out. Whether you assign a value to a variable or not to a variable that already has been set to this value, it doesn't change a thing. So, where's your thinko... It's quite obvious to me where you got the basis for the algorithm, it is (or used to be) in the official Perl FAQ. It goes something like this: `while (<IN>){ rand($.)<1 && ($line=$_); } print OUT $line;` [download] So you loop through the file, assign or don't assign the current value to $line based on a random value, and in the end, you print out what you have got. If you insist to do this 300 times, you will have to read through the file 300 times. If you don't want to do that, and you've got memory to spare, you can first read the contents of the file into an array, and randomly pick a line from that array. Using the same algorithm (for no good reason, it was chosen because it works without keeping everything in memory at once), this becomes: `my @lines = <IN>; for my $c (1 .. 300) { my $line; for my $i (0 .. $#lines) { rnd($i+1)<1 and $line = $#lines[$i]; } print OUT $line; }` [download] but it'll be a lot shorter to just write `my @lines = <IN>; for my $c (1 .. 300) { print OUT $lines[int rand @lines]; }` [download] That leaves in duplicates. If you don't want duplicates, simply import the `shuffle` function from List::Util, shuffle the lines array, and print out the first 300. `use List::Util qw(shuffle); my @lines = shuffle(<IN>); print OUT @lines[0 .. 299];` [download] This assumes there are at least 300 lines in the file, or you'll get a bunch of undefined values at the end.	[reply] [d/l] [select]
Re: Picking Random Lines from a File by ikegami (Patriarch) on Oct 09, 2008 at 17:33 UTC
When $. == 1, `while ($count<=$size){ rand($.)<1 && ($line=$_); print OUT $line; $count++; }` [download] is the same as `while ($count<=$size){ $line=$_; print OUT $line; $count++; }` [download] because `rand(1)` always returns something less than `1`. Therefore, you always print the first line $size times. And when it comes time for $. == 2, `$count` is greater than `$size` from the first pass, so the loop is never entered. There's nothing you want to happen to a line more than once, so there shouldn't be any nested loops. The inner `while` is probably suppose to be an `if`.	[reply] [d/l] [select]
Re: Picking Random Lines from a File by JavaFan (Canon) on Oct 09, 2008 at 20:35 UTC
The classical algorithm (I think it's described by Knuth) to pick N lines from a file with M lines (N <= M) goes like this: Read the first N lines in a buffer. For each next line (say, line k), decide with chance N/k, whether to accept or reject this line. If accepted, randomly replace one of the lines in the buffer. In Perl code, you get something like: `my @buffer; push @buffer, scalar <IN> for 1 .. $N; while (my $line = <IN>) { next unless rand($.) < $N; $buffer [rand @buffer] = $_; } print @buffer;` [download] A few points: It assumes you have enough memory to store N lines. If you have enough memory to slurp in the entire file, it may be easier to read in the file in an array, shuffle the array, and print the first $N entries. The code as is doesn't preserve order - but you can always store the line number with the line itself, and sort afterwards.	[reply] [d/l]
Re^2: Picking Random Lines from a File by nathanroy (Initiate) on Apr 21, 2009 at 20:20 UTC
Hi, I am fairly new to PERL, and I want to randomly select set of 4 lines chunk from a large file. I was looking at this tread, I am able to select randomly N lines, but I wanted to little bit more select a random line number(an odd number) and then select three lines following it and then select another random line (an odd number) and then select three lines following it and so on till N Any help would be greatly appreciated Thanks	[reply]
Re: Picking Random Lines from a File by toolic (Bishop) on Oct 09, 2008 at 17:45 UTC
You could try Randomly select N lines from a file, on the fly	[reply]
Re: Picking Random Lines from a File by Illuminatus (Curate) on Oct 09, 2008 at 17:51 UTC
Take a look at File::Random. random_line() looks like what you are looking for.	[reply]
Re: Picking Random Lines from a File by Illuminatus (Curate) on Oct 09, 2008 at 17:22 UTC
call to rand is actually fine	[reply]
Re^2: Picking Random Lines from a File by ikegami (Patriarch) on Oct 09, 2008 at 17:27 UTC
If so, his `perl` is broken. Quote `rand`, Automatically calls `srand` unless `srand` has already been called. `>perl -e"print rand 1" 0.83843994140625 >perl -e"print rand 1" 0.795257568359375 >perl -e"print rand 1" 0.84637451171875` [download]	[reply] [d/l] [select]
Re^2: Picking Random Lines from a File by de2425 (Sexton) on Oct 09, 2008 at 17:38 UTC
Is there another way I can put the rand arguement in there? I have a text file with about 90000 lines and I just need to pick 300 lines randomly from it. Sorry for all the questions. I'm truly still a novice at this.	[reply]