in reply to Randomizing Big Files

This should be about as efficient as you will get. It works in two passes: find start-of-line offsets on the first, then a long loop of seek-and-read directly to the respective offsets after shuffling indices.

Update: code changed per Anonymonk's hint. Old code:

#!/usr/bin/perl use strict; use warnings; use POSIX qw( ceil ); use List::Util qw( shuffle ); my $basepos = 0; open my $fh, '<', $ARGV[ 0 ] or die "Couldn't open $ARGV[ 0 ] for reading: $!\n"; my $offs = pack 'L', 0; while( my $advance = read $fh, $_, 128 * 1024 ) { s/(?=\n)/$offs .= pack 'L', $basepos + pos(); ""/eg; $basepos += advance; } # we will be looking for "start of following line", so this is # a bogus entry to obviate need for special case at last line $offs .= pack 'L', $basepos; my $total_lines = length( $offs ) / 4; my $been_here = "\0" x ceil( $total_lines / 8 ); my $i = 0; while( $i < $total_lines ) { my $line = int rand $total_lines; next if vec $been_here, $line, 1; my ( $start, $end ) = unpack "x " . ( 4 * $line ) . " L L", $offs; seek $fh, $start, 0; read $fh, $_, $end - $start; print; ++$i; vec( $been_here, $line, 1 ) = 1; }

New code:

#!/usr/bin/perl use strict; use warnings; use POSIX qw( ceil ); use List::Util qw( shuffle ); my $basepos = 0; open my $fh, '<', $ARGV[ 0 ] or die "Couldn't open $ARGV[ 0 ] for reading: $!\n"; my $offs; vec( $offs, 0, 32 ) = 0; my $total_lines = 1; while( my $advance = read $fh, $_, 128 * 1024 ) { s/(?=\n)/vec( $offs, $total_lines++, 32 ) = $basepos + pos(); ""/e +g; $basepos += advance; } # we will be looking for "start of following line", so this is # a bogus entry to obviate need for special case at last line vec( $offs, $total_lines, 32 ) = $basepos; my $been_here = "\0" x ceil( $total_lines / 8 ); my $i = 0; while( $i < $total_lines ) { my $line = int rand $total_lines; next if vec $been_here, $line, 1; my $start = vec $offs, $line, 32; my $end = vec $offs, $line + 1, 32; seek $fh, $start, 0; read $fh, $_, $end - $start; print; ++$i; vec( $been_here, $line, 1 ) = 1; }

Untested, but you get the idea.

Note: I'm not sure this shuffle is entirely fair. I don't think it is.

This uses $offs as a packed array of longs, and $been_here as a packed array of bools. Memory consumption with this approach will be proportional to the number of lines in your file, but not nearly as prohibitively so as with native Perl data structures (an array of numbers takes about 20 bytes per element). If you're really short of memory, you could use Sys::Mmap to use the structures from disk very efficiently; that would also deftly accelerate the initial gathering of offsets.

Makeshifts last the longest.

Replies are listed 'Best First'.
Re^2: Randomizing Big Files
by Anonymous Monk on Jan 26, 2005 at 15:59 UTC
    Something in this way can work, I will just point that a SCALAR in Perl is not a char*, since it can have UTF8/UNICODE values. We should use vec() to ensure that the data will be exactly the same that is appended with pack(). Also vec() already does the pack trick for us:
    vec($offs , $i , 32) = $n ;

      Yes, good points. Using vec would actually simplify the code, even.

      Makeshifts last the longest.