in reply to Re: selecting N random lines from a file in one pass
in thread selecting N random lines from a file in one pass
Re: Approach to efficiently choose a random line from a large file is a great reference! (++BrowserUk) Late last night I thought about the problem again and come up with something independently that was almost verbatim the referenced solution. If you're willing to live with something not truly random (biased slightly by line size), then just randomly seeking into the file and pulling out a line N times is far, far faster than reading all the lines. It's most biased if your lines are very different in length. But if it just needs to look like random lines, then this is hands down the winner:
s/iter original late_chomp rand_seek original 4.83 -- -9% -100% late_chomp 4.39 10% -- -100% rand_seek 1.20e-03 402667% 365992% --
This is based on pulling 10 lines from the 123 MB cpan-gets file (from Help update the Phalanx 100). My code is similar to Re: Approach to efficiently choose a random line from a large file though rather than search backwards from a random position, I read and discard the line fragment I jumped to and then take the next line.(Which does mean the first line can't be picked, but just insert a blank line at the beginning.)
# assumes a global "$n" and "$file" sub rand_seek { my $size = $n || 3; # sample size my (@sample, @seen, $line, $pos_after); my $filesize = -s $file; open(my $fh, $file) or die; while (@sample < $size) { seek($fh,int(rand($filesize)),0); <$fh>; # skip this fragment of a line $line = <$fh>; # get the next line $pos_after = tell($fh); next if grep { $pos_after == $_ } @seen; chomp $line; push @sample, $line if $line; push @seen, $pos_after; } close $fh }
-xdg
Code posted by xdg on PerlMonks is public domain. It has no warranties, express or implied. Posted code may not have been tested. Use at your own risk.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: selecting N random lines from a file in one pass
by jdporter (Paladin) on Dec 23, 2004 at 20:51 UTC |