comment on

The problem with Knuth method is that it requires you to store 20million lines in memory which will take pretty much the full 2GB maximum memory generally available, if your lines average a modest 50 chars in length. If they average longer, or your file is bigger, you've blown it.

This code

Generates a packed list of 40e6 integers (315MB) (40e6*4*2 (because of the way Perl initialises scalars)).
F-Y shuffles that list in-place (+0MB) 2 minutes.
"Sorts" the first 20e6 of the shuffled integers by building a bitvector. (+5MB) 30 seconds.
Reads the file and print a line if it bit is set in the bitvector.

In total, <320MB and ~2.5 minutes on my machine.

#! perl -slw
use strict;
use Math::Random::Mt qw[ rand ];

our $N ||= 40e6;
our $K ||= $N / 2;

## In-place FY shuffle of a packed ULONG array
sub shuffle {
    my $ref = shift;
    my $n = length( $$ref ) >> 2;
    for( 0 .. $n ) {
        my $p = $_ + int( rand $n - $_ );
        my $tmp = substr $$ref, $_*4, 4;
        substr $$ref, $_*4, 4, substr $$ref, $p*4, 4;
        substr $$ref, $p*4, 4, $tmp;
    }
    return;
}

## Allocate a packed array of $N Ulongs
my $index = \ (chr(0) x ( 4 * $N ));

## Populate it
substr $$index, $_*4, 4, pack 'V', $_  for 0 .. $N - 1;

## Shuffle in place
shuffle $index;

## Allocate a bit-vector for linear sort
my $ordered = chr(0) x int( ( $N +7 ) / 8 );

## Linear sort
vec( $ordered, unpack( 'V', substr $$index, $_*4, 4 ), 1 ) = 1
    for 0 .. $K - 1;

## Read the file line by line
while( <> ) {
    ## print it if it was picked
    print if vec( $ordered, $., 1 );
}
[download]

The nice thing is that it will comfortably extend linearly to handle selecting any number of lines from files of upto 250 million lines within a 2GB process.

With a little hackery to avoid the Perl memory doubling scalar initialisation, it could be extended to handle any number from a 500 million line file.

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."

In reply to Re: A series of random number and others by BrowserUk
in thread A series of random number and others by lightoverhead

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.