in reply to selecting N random lines from a file in one pass

If you really want efficiency, have multile loops rather than a single loop with an if test. Like this:
use strict; my $size = shift @ARGV || 3; # sample size my @sample; # read n lines for (1..$size) { push @sample, scalar <>; } # shuffle - I'll do it in pure perl. my $i = $size; while ($i--) { my $j = int rand($i + 1); @sample[$i, $j] = @sample[$j, $i]; } # now read the rest of the lines in. while (<>) { my $choice = int rand($.); if ($choice < $size) { $sample[$choice] = $_; } } # and minimize work by chomping as few lines as possible. chomp(@sample);
This can be beaten. But not by much...

Update: Per the note by AM below, fix off-by-one error.

Replies are listed 'Best First'.
Re^2: selecting N random lines from a file in one pass
by Anonymous Monk on Dec 23, 2004 at 13:04 UTC
    # shuffle - I'll do it in pure perl.
    Too bad your implementation is flawed:
    use strict; use warnings; my $size = 4; my @array = 1 .. $size; my %count; foreach (1 .. 24_000) { my @sample = @array; my $i = $size; while ($i--) { my $j = int rand($i); @sample[$i, $j] = @sample[$j, $i]; } $count {"@sample"} ++; } foreach my $key (sort keys %count) { printf "%s: %4d\n", $key, $count {$key} } __END__ 2 3 4 1: 3999 2 4 1 3: 4029 3 1 4 2: 3969 3 4 2 1: 4000 4 1 2 3: 4031 4 3 1 2: 3972
    In stead of getting 24 different permutations, each appearing about 1,000 times, your algorithm produces only 6 permutations, each appearing about 4,000 times.

    Better use a module next time!