Re: selecting N random lines from a file in one pass

If you really want efficiency, have multile loops rather than a single loop with an if test. Like this:

use strict;
my $size = shift @ARGV || 3; # sample size
my @sample;
# read n lines
for (1..$size) {
  push @sample, scalar <>;
}
# shuffle - I'll do it in pure perl.
my $i = $size;
while ($i--) {
  my $j = int rand($i + 1);
  @sample[$i, $j] = @sample[$j, $i];
}
# now read the rest of the lines in.
while (<>) {
  my $choice = int rand($.);
  if ($choice < $size) {
    $sample[$choice] = $_;
  }
}
# and minimize work by chomping as few lines as possible.
chomp(@sample);
[download]

This can be beaten. But not by much...

Update: Per the note by AM below, fix off-by-one error.

Comment on Re: selecting N random lines from a file in one pass Download Code

Replies are listed 'Best First'.
Re^2: selecting N random lines from a file in one pass by Anonymous Monk on Dec 23, 2004 at 13:04 UTC
`# shuffle - I'll do it in pure perl.` [download] Too bad your implementation is flawed: `use strict; use warnings; my $size = 4; my @array = 1 .. $size; my %count; foreach (1 .. 24_000) { my @sample = @array; my $i = $size; while ($i--) { my $j = int rand($i); @sample[$i, $j] = @sample[$j, $i]; } $count {"@sample"} ++; } foreach my $key (sort keys %count) { printf "%s: %4d\n", $key, $count {$key} } __END__ 2 3 4 1: 3999 2 4 1 3: 4029 3 1 4 2: 3969 3 4 2 1: 4000 4 1 2 3: 4031 4 3 1 2: 3972` [download] In stead of getting 24 different permutations, each appearing about 1,000 times, your algorithm produces only 6 permutations, each appearing about 4,000 times. Better use a module next time!	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: selecting N random lines from a file in one pass
by Anonymous Monk on Dec 23, 2004 at 13:04 UTC

# shuffle - I'll do it in pure perl.
[download]

use strict;
use warnings;

my $size  = 4;
my @array = 1 .. $size;
my %count;

foreach (1 .. 24_000) {
    my @sample = @array;
    my $i = $size;
    while ($i--) {
        my $j = int rand($i);
        @sample[$i, $j] = @sample[$j, $i];
    }
    $count {"@sample"} ++;
}

foreach my $key (sort keys %count) {
    printf "%s: %4d\n", $key, $count {$key}
}

__END__
2 3 4 1: 3999
2 4 1 3: 4029
3 1 4 2: 3969
3 4 2 1: 4000
4 1 2 3: 4031
4 3 1 2: 3972
[download]

Better use a module next time!

[reply]
[d/l]
[select]