selecting N random lines from a file in one pass

The cookbook has a nice example of how to pull one line of a file out without loading all the lines into memory, or creating an offset index. My challange, I decided, was to do the same but with N lines, without slurping the file or seeking through an offset index. Here's my solution...

It's definitely a tradeoff of memory vs cpu. For the actual work problem that inspired this, I used an offset index to be reused in mod_perl. Still, this solution might be useful if your file has so many lines that even the index is too large, or if you're simply looking for Another Way To Do It.

Basically, it's the same as the cookbook one. For each line you pass, the probability that you pick it up and assign it to a slot is size of the sample divided by the number of lines seen so far. If you do pick it up, you need to decide which element you currently have to throw out, so you pick a random offset based on the size of sampling.

I think this works out mathmatically (check, anyone?), and only needs special case handling for the first N lines, which you need to keep all of. I thought of just reading in N lines to begin with, but that loses some randomness in order that would have to be handled with a shuffle at the end. I actually decided to go with some extra handling in the algorithm, but it does add some overhead so I'm still considering the alternative.


use strict;
my $size = shift @ARGV || 3; # sample size
my @sample;


my $taken = 0; # for making sure we get the first $size lines

while( my $line =  <> ) {
    chomp $line;

    if ( rand(1) < ($size/$.) ){
        my $position;
        do{
            $position = int rand($size);
        }
        while( $taken < $size && $sample[$position] );
        $sample[$position] = $line;
        $taken++;
    }
}

for( my $i = 0; $i < @sample; $i++ ){
    print "[$i] '$sample[$i]'\n";
}
print "\n";
[download]

Comment on selecting N random lines from a file in one pass Download Code

Replies are listed 'Best First'.
Re: selecting N random lines from a file in one pass by BrowserUk (Patriarch) on Dec 22, 2004 at 18:29 UTC
For a tecnhique that achieves a pretty good "random" selection, with needing to read every line in the file, see Re: Approach to efficiently choose a random line from a large file. And for a few thoughts about the randomness of such an approach see Re^2: Approach to efficiently choose a random line from a large file. Examine what is said, not who speaks. The end of an era! "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply]
Re^2: selecting N random lines from a file in one pass by xdg (Monsignor) on Dec 23, 2004 at 12:33 UTC
Re: Approach to efficiently choose a random line from a large file is a great reference! (++BrowserUk) Late last night I thought about the problem again and come up with something independently that was almost verbatim the referenced solution. If you're willing to live with something not truly random (biased slightly by line size), then just randomly seeking into the file and pulling out a line N times is far, far faster than reading all the lines. It's most biased if your lines are very different in length. But if it just needs to look like random lines, then this is hands down the winner: `s/iter original late_chomp rand_seek original 4.83 -- -9% -100% late_chomp 4.39 10% -- -100% rand_seek 1.20e-03 402667% 365992% --` [download] This is based on pulling 10 lines from the 123 MB cpan-gets file (from Help update the Phalanx 100). My code is similar to Re: Approach to efficiently choose a random line from a large file though rather than search backwards from a random position, I read and discard the line fragment I jumped to and then take the next line.(Which does mean the first line can't be picked, but just insert a blank line at the beginning.) `# assumes a global "$n" and "$file" sub rand_seek { my $size = $n \|\| 3; # sample size my (@sample, @seen, $line, $pos_after); my $filesize = -s $file; open(my $fh, $file) or die; while (@sample < $size) { seek($fh,int(rand($filesize)),0); <$fh>; # skip this fragment of a line $line = <$fh>; # get the next line $pos_after = tell($fh); next if grep { $pos_after == $_ } @seen; chomp $line; push @sample, $line if $line; push @seen, $pos_after; } close $fh }` [download] -xdg Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.	[reply] [d/l] [select]
Re^3: selecting N random lines from a file in one pass by jdporter (Paladin) on Dec 23, 2004 at 20:51 UTC
(Which does mean the first line can't be picked, but just insert a blank line at the beginning.) And it also means that if your random seek takes you into the last line of the file, you get the nothing at the end of the file. So, in that case, you could kill both birds by taking the first line of the file - which happens to be very quickly findable. ;-)	[reply]
Re: selecting N random lines from a file in one pass by tilly (Archbishop) on Dec 23, 2004 at 02:02 UTC
If you really want efficiency, have multile loops rather than a single loop with an if test. Like this: `use strict; my $size = shift @ARGV \|\| 3; # sample size my @sample; # read n lines for (1..$size) { push @sample, scalar <>; } # shuffle - I'll do it in pure perl. my $i = $size; while ($i--) { my $j = int rand($i + 1); @sample[$i, $j] = @sample[$j, $i]; } # now read the rest of the lines in. while (<>) { my $choice = int rand($.); if ($choice < $size) { $sample[$choice] = $_; } } # and minimize work by chomping as few lines as possible. chomp(@sample);` [download] This can be beaten. But not by much... Update: Per the note by AM below, fix off-by-one error.	[reply] [d/l]
Re^2: selecting N random lines from a file in one pass by Anonymous Monk on Dec 23, 2004 at 13:04 UTC
`# shuffle - I'll do it in pure perl.` [download] Too bad your implementation is flawed: `use strict; use warnings; my $size = 4; my @array = 1 .. $size; my %count; foreach (1 .. 24_000) { my @sample = @array; my $i = $size; while ($i--) { my $j = int rand($i); @sample[$i, $j] = @sample[$j, $i]; } $count {"@sample"} ++; } foreach my $key (sort keys %count) { printf "%s: %4d\n", $key, $count {$key} } __END__ 2 3 4 1: 3999 2 4 1 3: 4029 3 1 4 2: 3969 3 4 2 1: 4000 4 1 2 3: 4031 4 3 1 2: 3972` [download] In stead of getting 24 different permutations, each appearing about 1,000 times, your algorithm produces only 6 permutations, each appearing about 4,000 times. Better use a module next time!	[reply] [d/l] [select]
Re: selecting N random lines from a file in one pass by Anonymous Monk on Dec 22, 2004 at 17:34 UTC
Off-topic, but I can't resist. Every time I see a C-style for loop in someone's Perl code, I die a little inside. How about this instead: `for my $i ( 0 .. $#sample ) { print "[$i] '$sample[$i]'\n"; }` [download]	[reply] [d/l]
Re^2: selecting N random lines from a file in one pass by seuratt (Novice) on Dec 22, 2004 at 17:49 UTC
Old habits... =) Anyone know if that's actually faster or slower in execution? What about memory consumption?	[reply]
Re^3: selecting N random lines from a file in one pass by perrin (Chancellor) on Dec 22, 2004 at 18:38 UTC
Don't worry about it. If there are any differences, they are so small you would never see them outside of a crazy benchmark.	[reply]
Re^2: selecting N random lines from a file in one pass by Anonymous Monk on Dec 23, 2004 at 15:10 UTC
Every time I see someone use `$#array` just to be able to use `foreach` to loop over the indices and avoid using a C-style loop, I die a little.	[reply] [d/l] [select]
Re: selecting N random lines from a file in one pass by itub (Priest) on Dec 22, 2004 at 18:28 UTC
Thanks, you must have read my mind. I was just thinking about this same problem yesterday! There is one potential problem, depending on what one wants. If the file size is smaller than the sample size, you can get a sample with "holes": `$ ./select.pl 3 1 2 [0] '2' [1] '' [2] '1'` [download]	[reply] [d/l]
Re: selecting N random lines from a file in one pass by xdg (Monsignor) on Dec 22, 2004 at 23:33 UTC
If the focus is efficiency, why are you chomping every line when you only want a few? Moving the `chomp` inside the `if` will speed up your code by about 10% - 12% (based on a quick benchmark run I did). -xdg Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.	[reply] [d/l] [select]
Re: selecting N random lines from a file in one pass by boo_radley (Parson) on Dec 22, 2004 at 17:02 UTC
behold my shameless self promotion.	[reply]
Re^2: selecting N random lines from a file in one pass by seuratt (Novice) on Dec 22, 2004 at 17:39 UTC
Your self-promotion requires reading the entire file into memory before it can be passed to the module, which makes it kind of inapplicable to problem. However, if you do have the entire data set in memory, there's a simpler approach than the math theory you use in your module. Rather than iterate the array, consider just drawing at random. If there's a duplicate, then redraw. The larger the set, the less the odds of a duplicate. This scales a lot better than your module, but I admit that does trip up more on sets close to sample size. It's a lot like how, in gaming, you can fake a d12 by rolling a d20 until you've got a number in range. `use strict; use Data::Dumper; my @set = ('a'..'z'); my $size = 3; my @sample; my %seen; while( @sample < $size ){ my $elt; do{ $elt = int rand(@set); }while( $seen{$elt} ); push @sample, $set[$elt]; $seen{$elt}++; } print Dumper \@sample;` [download] This example should really have a check, too, to see if the sample size is in the bounds of the set, to avoid looping.	[reply] [d/l]