Re: selecting N random lines from a file in one pass
by BrowserUk (Patriarch) on Dec 22, 2004 at 18:29 UTC
|
| [reply] |
|
Re: Approach to efficiently choose a random line from a large file is a great reference! (++BrowserUk) Late last night I thought about the problem again and come up with something independently that was almost verbatim the referenced solution. If you're willing to live with something not truly random (biased slightly by line size), then just randomly seeking into the file and pulling out a line N times is far, far faster than reading all the lines. It's most biased if your lines are very different in length. But if it just needs to look like random lines, then this is hands down the winner:
s/iter original late_chomp rand_seek
original 4.83 -- -9% -100%
late_chomp 4.39 10% -- -100%
rand_seek 1.20e-03 402667% 365992% --
This is based on pulling 10 lines from the 123 MB cpan-gets file (from Help update the Phalanx 100). My code is similar to Re: Approach to efficiently choose a random line from a large file though rather than search backwards from a random position, I read and discard the line fragment I jumped to and then take the next line.(Which does mean the first line can't be picked, but just insert a blank line at the beginning.)
# assumes a global "$n" and "$file"
sub rand_seek {
my $size = $n || 3; # sample size
my (@sample, @seen, $line, $pos_after);
my $filesize = -s $file;
open(my $fh, $file) or die;
while (@sample < $size) {
seek($fh,int(rand($filesize)),0);
<$fh>; # skip this fragment of a line
$line = <$fh>; # get the next line
$pos_after = tell($fh);
next if grep { $pos_after == $_ } @seen;
chomp $line;
push @sample, $line if $line;
push @seen, $pos_after;
}
close $fh
}
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] [d/l] [select] |
|
(Which does mean the first line can't be picked, but just insert a blank line at the beginning.)
And it also means that if your random seek takes you into the last line of the file, you get the nothing at the end of the file. So, in that case, you could kill both birds by taking the first line of the file - which happens to be very quickly findable. ;-)
| [reply] |
Re: selecting N random lines from a file in one pass
by tilly (Archbishop) on Dec 23, 2004 at 02:02 UTC
|
If you really want efficiency, have multile loops rather than a single loop with an if test. Like this:
use strict;
my $size = shift @ARGV || 3; # sample size
my @sample;
# read n lines
for (1..$size) {
push @sample, scalar <>;
}
# shuffle - I'll do it in pure perl.
my $i = $size;
while ($i--) {
my $j = int rand($i + 1);
@sample[$i, $j] = @sample[$j, $i];
}
# now read the rest of the lines in.
while (<>) {
my $choice = int rand($.);
if ($choice < $size) {
$sample[$choice] = $_;
}
}
# and minimize work by chomping as few lines as possible.
chomp(@sample);
This can be beaten. But not by much...
Update: Per the note by AM below, fix off-by-one error. | [reply] [d/l] |
|
# shuffle - I'll do it in pure perl.
Too bad your implementation is flawed:
use strict;
use warnings;
my $size = 4;
my @array = 1 .. $size;
my %count;
foreach (1 .. 24_000) {
my @sample = @array;
my $i = $size;
while ($i--) {
my $j = int rand($i);
@sample[$i, $j] = @sample[$j, $i];
}
$count {"@sample"} ++;
}
foreach my $key (sort keys %count) {
printf "%s: %4d\n", $key, $count {$key}
}
__END__
2 3 4 1: 3999
2 4 1 3: 4029
3 1 4 2: 3969
3 4 2 1: 4000
4 1 2 3: 4031
4 3 1 2: 3972
In stead of getting 24 different permutations, each appearing about 1,000 times, your algorithm produces only 6 permutations, each appearing about 4,000 times.
Better use a module next time! | [reply] [d/l] [select] |
Re: selecting N random lines from a file in one pass
by Anonymous Monk on Dec 22, 2004 at 17:34 UTC
|
Off-topic, but I can't resist. Every time I see a C-style for loop in someone's Perl code, I die a little inside. How about this instead:
for my $i ( 0 .. $#sample ) {
print "[$i] '$sample[$i]'\n";
}
| [reply] [d/l] |
|
Old habits... =)
Anyone know if that's actually faster or slower in execution? What about memory consumption?
| [reply] |
|
Don't worry about it. If there are any differences, they are so small you would never see them outside of a crazy benchmark.
| [reply] |
|
Every time I see someone use $#array just to be able to use foreach to loop over the indices and avoid using a C-style loop, I die a little.
| [reply] [d/l] [select] |
Re: selecting N random lines from a file in one pass
by itub (Priest) on Dec 22, 2004 at 18:28 UTC
|
Thanks, you must have read my mind. I was just thinking about this same problem yesterday!
There is one potential problem, depending on what one wants. If the file size is smaller than the sample size, you can get a sample with "holes":
$ ./select.pl 3
1
2
[0] '2'
[1] ''
[2] '1'
| [reply] [d/l] |
Re: selecting N random lines from a file in one pass
by xdg (Monsignor) on Dec 22, 2004 at 23:33 UTC
|
If the focus is efficiency, why are you chomping every line when you only want a few? Moving the chomp inside the if will speed up your code by about 10% - 12% (based on a quick benchmark run I did).
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
| [reply] [d/l] [select] |
Re: selecting N random lines from a file in one pass
by boo_radley (Parson) on Dec 22, 2004 at 17:02 UTC
|
| [reply] |
|
Your self-promotion requires reading the entire file into memory before it can be passed to the module, which makes it kind of inapplicable to problem.
However, if you do have the entire data set in memory, there's a simpler approach than the math theory you use in your module. Rather than iterate the array, consider just drawing at random. If there's a duplicate, then redraw. The larger the set, the less the odds of a duplicate. This scales a lot better than your module, but I admit that does trip up more on sets close to sample size.
It's a lot like how, in gaming, you can fake a d12 by rolling a d20 until you've got a number in range.
use strict;
use Data::Dumper;
my @set = ('a'..'z');
my $size = 3;
my @sample;
my %seen;
while( @sample < $size ){
my $elt;
do{
$elt = int rand(@set);
}while( $seen{$elt} );
push @sample, $set[$elt];
$seen{$elt}++;
}
print Dumper \@sample;
This example should really have a check, too, to see if the sample size is in the bounds of the set, to avoid looping.
| [reply] [d/l] |