in reply to A series of random number and others

The problem with Knuth method is that it requires you to store 20million lines in memory which will take pretty much the full 2GB maximum memory generally available, if your lines average a modest 50 chars in length. If they average longer, or your file is bigger, you've blown it.

This code

  1. Generates a packed list of 40e6 integers (315MB) (40e6*4*2 (because of the way Perl initialises scalars)).
  2. F-Y shuffles that list in-place (+0MB) 2 minutes.
  3. "Sorts" the first 20e6 of the shuffled integers by building a bitvector. (+5MB) 30 seconds.
  4. Reads the file and print a line if it bit is set in the bitvector.

In total, <320MB and ~2.5 minutes on my machine.

#! perl -slw use strict; use Math::Random::Mt qw[ rand ]; our $N ||= 40e6; our $K ||= $N / 2; ## In-place FY shuffle of a packed ULONG array sub shuffle { my $ref = shift; my $n = length( $$ref ) >> 2; for( 0 .. $n ) { my $p = $_ + int( rand $n - $_ ); my $tmp = substr $$ref, $_*4, 4; substr $$ref, $_*4, 4, substr $$ref, $p*4, 4; substr $$ref, $p*4, 4, $tmp; } return; } ## Allocate a packed array of $N Ulongs my $index = \ (chr(0) x ( 4 * $N )); ## Populate it substr $$index, $_*4, 4, pack 'V', $_ for 0 .. $N - 1; ## Shuffle in place shuffle $index; ## Allocate a bit-vector for linear sort my $ordered = chr(0) x int( ( $N +7 ) / 8 ); ## Linear sort vec( $ordered, unpack( 'V', substr $$index, $_*4, 4 ), 1 ) = 1 for 0 .. $K - 1; ## Read the file line by line while( <> ) { ## print it if it was picked print if vec( $ordered, $., 1 ); }

The nice thing is that it will comfortably extend linearly to handle selecting any number of lines from files of upto 250 million lines within a 2GB process.

With a little hackery to avoid the Perl memory doubling scalar initialisation, it could be extended to handle any number from a 500 million line file.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."