The problem with Knuth method is that it requires you to store 20million lines in memory which will take pretty much the full 2GB maximum memory generally available, if your lines average a modest 50 chars in length. If they average longer, or your file is bigger, you've blown it.
This code
In total, <320MB and ~2.5 minutes on my machine.
#! perl -slw use strict; use Math::Random::Mt qw[ rand ]; our $N ||= 40e6; our $K ||= $N / 2; ## In-place FY shuffle of a packed ULONG array sub shuffle { my $ref = shift; my $n = length( $$ref ) >> 2; for( 0 .. $n ) { my $p = $_ + int( rand $n - $_ ); my $tmp = substr $$ref, $_*4, 4; substr $$ref, $_*4, 4, substr $$ref, $p*4, 4; substr $$ref, $p*4, 4, $tmp; } return; } ## Allocate a packed array of $N Ulongs my $index = \ (chr(0) x ( 4 * $N )); ## Populate it substr $$index, $_*4, 4, pack 'V', $_ for 0 .. $N - 1; ## Shuffle in place shuffle $index; ## Allocate a bit-vector for linear sort my $ordered = chr(0) x int( ( $N +7 ) / 8 ); ## Linear sort vec( $ordered, unpack( 'V', substr $$index, $_*4, 4 ), 1 ) = 1 for 0 .. $K - 1; ## Read the file line by line while( <> ) { ## print it if it was picked print if vec( $ordered, $., 1 ); }
The nice thing is that it will comfortably extend linearly to handle selecting any number of lines from files of upto 250 million lines within a 2GB process.
With a little hackery to avoid the Perl memory doubling scalar initialisation, it could be extended to handle any number from a 500 million line file.
In reply to Re: A series of random number and others
by BrowserUk
in thread A series of random number and others
by lightoverhead
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |