Re: Randomizing Big Files

This should be about as efficient as you will get. It works in two passes: find start-of-line offsets on the first, then a long loop of seek-and-read directly to the respective offsets after shuffling indices.

Update: code changed per Anonymonk's hint. Old code:

#!/usr/bin/perl
use strict;
use warnings;

use POSIX qw( ceil );
use List::Util qw( shuffle );

my $basepos = 0;

open my $fh, '<', $ARGV[ 0 ]
    or die "Couldn't open $ARGV[ 0 ] for reading: $!\n";

my $offs = pack 'L', 0;

while( my $advance = read $fh, $_, 128 * 1024 ) {
    s/(?=\n)/$offs .= pack 'L', $basepos + pos(); ""/eg;
    $basepos += advance;
}

# we will be looking for "start of following line", so this is
# a bogus entry to obviate need for special case at last line
$offs .= pack 'L', $basepos;

my $total_lines = length( $offs ) / 4;

my $been_here = "\0" x ceil( $total_lines / 8 );

my $i = 0;

while( $i < $total_lines ) {
    my $line = int rand $total_lines;
    next if vec $been_here, $line, 1;
    my ( $start, $end ) = unpack "x " . ( 4 * $line ) . " L L", $offs;
    seek $fh, $start, 0;
    read $fh, $_, $end - $start;
    print;
    ++$i;
    vec( $been_here, $line, 1 ) = 1;
}
[download]

New code:

#!/usr/bin/perl
use strict;
use warnings;

use POSIX qw( ceil );
use List::Util qw( shuffle );

my $basepos = 0;

open my $fh, '<', $ARGV[ 0 ]
    or die "Couldn't open $ARGV[ 0 ] for reading: $!\n";

my $offs;
vec( $offs, 0, 32 ) = 0;

my $total_lines = 1;

while( my $advance = read $fh, $_, 128 * 1024 ) {
    s/(?=\n)/vec( $offs, $total_lines++, 32 ) = $basepos + pos(); ""/e
+g;
    $basepos += advance;
}

# we will be looking for "start of following line", so this is
# a bogus entry to obviate need for special case at last line
vec( $offs, $total_lines, 32 ) = $basepos;

my $been_here = "\0" x ceil( $total_lines / 8 );

my $i = 0;

while( $i < $total_lines ) {
    my $line = int rand $total_lines;
    next if vec $been_here, $line, 1;
    my $start = vec $offs, $line, 32;
    my $end = vec $offs, $line + 1, 32;
    seek $fh, $start, 0;
    read $fh, $_, $end - $start;
    print;
    ++$i;
    vec( $been_here, $line, 1 ) = 1;
}
[download]

Untested, but you get the idea.

Note: I'm not sure this shuffle is entirely fair. I don't think it is.

This uses $offs as a packed array of longs, and $been_here as a packed array of bools. Memory consumption with this approach will be proportional to the number of lines in your file, but not nearly as prohibitively so as with native Perl data structures (an array of numbers takes about 20 bytes per element). If you're really short of memory, you could use Sys::Mmap to use the structures from disk very efficiently; that would also deftly accelerate the initial gathering of offsets.

Makeshifts last the longest.

Comment on Re: Randomizing Big Files Select or Download Code

Replies are listed 'Best First'.
Re^2: Randomizing Big Files by Anonymous Monk on Jan 26, 2005 at 15:59 UTC
Something in this way can work, I will just point that a SCALAR in Perl is not a char*, since it can have UTF8/UNICODE values. We should use vec() to ensure that the data will be exactly the same that is appended with pack(). Also vec() already does the pack trick for us: `vec($offs , $i , 32) = $n ;` [download]	[reply] [d/l]
Re^3: Randomizing Big Files by Aristotle (Chancellor) on Jan 26, 2005 at 16:18 UTC
Yes, good points. Using vec would actually simplify the code, even. Makeshifts last the longest.	[reply]