The cookbook has a nice example of how to pull one line of a file out without loading all the lines into memory, or creating an offset index. My challange, I decided, was to do the same but with N lines, without slurping the file or seeking through an offset index. Here's my solution...

It's definitely a tradeoff of memory vs cpu. For the actual work problem that inspired this, I used an offset index to be reused in mod_perl. Still, this solution might be useful if your file has so many lines that even the index is too large, or if you're simply looking for Another Way To Do It.

Basically, it's the same as the cookbook one. For each line you pass, the probability that you pick it up and assign it to a slot is size of the sample divided by the number of lines seen so far. If you do pick it up, you need to decide which element you currently have to throw out, so you pick a random offset based on the size of sampling.

I think this works out mathmatically (check, anyone?), and only needs special case handling for the first N lines, which you need to keep all of. I thought of just reading in N lines to begin with, but that loses some randomness in order that would have to be handled with a shuffle at the end. I actually decided to go with some extra handling in the algorithm, but it does add some overhead so I'm still considering the alternative.

use strict; my $size = shift @ARGV || 3; # sample size my @sample; my $taken = 0; # for making sure we get the first $size lines while( my $line = <> ) { chomp $line; if ( rand(1) < ($size/$.) ){ my $position; do{ $position = int rand($size); } while( $taken < $size && $sample[$position] ); $sample[$position] = $line; $taken++; } } for( my $i = 0; $i < @sample; $i++ ){ print "[$i] '$sample[$i]'\n"; } print "\n";

Replies are listed 'Best First'.
Re: selecting N random lines from a file in one pass
by BrowserUk (Patriarch) on Dec 22, 2004 at 18:29 UTC

      Re: Approach to efficiently choose a random line from a large file is a great reference! (++BrowserUk) Late last night I thought about the problem again and come up with something independently that was almost verbatim the referenced solution. If you're willing to live with something not truly random (biased slightly by line size), then just randomly seeking into the file and pulling out a line N times is far, far faster than reading all the lines. It's most biased if your lines are very different in length. But if it just needs to look like random lines, then this is hands down the winner:

      s/iter original late_chomp rand_seek original 4.83 -- -9% -100% late_chomp 4.39 10% -- -100% rand_seek 1.20e-03 402667% 365992% --

      This is based on pulling 10 lines from the 123 MB cpan-gets file (from Help update the Phalanx 100). My code is similar to Re: Approach to efficiently choose a random line from a large file though rather than search backwards from a random position, I read and discard the line fragment I jumped to and then take the next line.(Which does mean the first line can't be picked, but just insert a blank line at the beginning.)

      # assumes a global "$n" and "$file" sub rand_seek { my $size = $n || 3; # sample size my (@sample, @seen, $line, $pos_after); my $filesize = -s $file; open(my $fh, $file) or die; while (@sample < $size) { seek($fh,int(rand($filesize)),0); <$fh>; # skip this fragment of a line $line = <$fh>; # get the next line $pos_after = tell($fh); next if grep { $pos_after == $_ } @seen; chomp $line; push @sample, $line if $line; push @seen, $pos_after; } close $fh }

      -xdg

      Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

        (Which does mean the first line can't be picked, but just insert a blank line at the beginning.)
        And it also means that if your random seek takes you into the last line of the file, you get the nothing at the end of the file. So, in that case, you could kill both birds by taking the first line of the file - which happens to be very quickly findable. ;-)
Re: selecting N random lines from a file in one pass
by tilly (Archbishop) on Dec 23, 2004 at 02:02 UTC
    If you really want efficiency, have multile loops rather than a single loop with an if test. Like this:
    use strict; my $size = shift @ARGV || 3; # sample size my @sample; # read n lines for (1..$size) { push @sample, scalar <>; } # shuffle - I'll do it in pure perl. my $i = $size; while ($i--) { my $j = int rand($i + 1); @sample[$i, $j] = @sample[$j, $i]; } # now read the rest of the lines in. while (<>) { my $choice = int rand($.); if ($choice < $size) { $sample[$choice] = $_; } } # and minimize work by chomping as few lines as possible. chomp(@sample);
    This can be beaten. But not by much...

    Update: Per the note by AM below, fix off-by-one error.

      # shuffle - I'll do it in pure perl.
      Too bad your implementation is flawed:
      use strict; use warnings; my $size = 4; my @array = 1 .. $size; my %count; foreach (1 .. 24_000) { my @sample = @array; my $i = $size; while ($i--) { my $j = int rand($i); @sample[$i, $j] = @sample[$j, $i]; } $count {"@sample"} ++; } foreach my $key (sort keys %count) { printf "%s: %4d\n", $key, $count {$key} } __END__ 2 3 4 1: 3999 2 4 1 3: 4029 3 1 4 2: 3969 3 4 2 1: 4000 4 1 2 3: 4031 4 3 1 2: 3972
      In stead of getting 24 different permutations, each appearing about 1,000 times, your algorithm produces only 6 permutations, each appearing about 4,000 times.

      Better use a module next time!

Re: selecting N random lines from a file in one pass
by Anonymous Monk on Dec 22, 2004 at 17:34 UTC

    Off-topic, but I can't resist. Every time I see a C-style for loop in someone's Perl code, I die a little inside. How about this instead:

    for my $i ( 0 .. $#sample ) { print "[$i] '$sample[$i]'\n"; }
      Old habits... =)
      Anyone know if that's actually faster or slower in execution? What about memory consumption?
        Don't worry about it. If there are any differences, they are so small you would never see them outside of a crazy benchmark.
      Every time I see someone use $#array just to be able to use foreach to loop over the indices and avoid using a C-style loop, I die a little.
Re: selecting N random lines from a file in one pass
by itub (Priest) on Dec 22, 2004 at 18:28 UTC
    Thanks, you must have read my mind. I was just thinking about this same problem yesterday!

    There is one potential problem, depending on what one wants. If the file size is smaller than the sample size, you can get a sample with "holes":

    $ ./select.pl 3 1 2 [0] '2' [1] '' [2] '1'
Re: selecting N random lines from a file in one pass
by xdg (Monsignor) on Dec 22, 2004 at 23:33 UTC

    If the focus is efficiency, why are you chomping every line when you only want a few? Moving the chomp inside the if will speed up your code by about 10% - 12% (based on a quick benchmark run I did).

    -xdg

    Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.

Re: selecting N random lines from a file in one pass
by boo_radley (Parson) on Dec 22, 2004 at 17:02 UTC
      Your self-promotion requires reading the entire file into memory before it can be passed to the module, which makes it kind of inapplicable to problem.

      However, if you do have the entire data set in memory, there's a simpler approach than the math theory you use in your module. Rather than iterate the array, consider just drawing at random. If there's a duplicate, then redraw. The larger the set, the less the odds of a duplicate. This scales a lot better than your module, but I admit that does trip up more on sets close to sample size.

      It's a lot like how, in gaming, you can fake a d12 by rolling a d20 until you've got a number in range.

      use strict; use Data::Dumper; my @set = ('a'..'z'); my $size = 3; my @sample; my %seen; while( @sample < $size ){ my $elt; do{ $elt = int rand(@set); }while( $seen{$elt} ); push @sample, $set[$elt]; $seen{$elt}++; } print Dumper \@sample;

      This example should really have a check, too, to see if the sample size is in the bounds of the set, to avoid looping.