in reply to Randomizing Big Files
This should be about as efficient as you will get. It works in two passes: find start-of-line offsets on the first, then a long loop of seek-and-read directly to the respective offsets after shuffling indices.
Update: code changed per Anonymonk's hint. Old code:
#!/usr/bin/perl use strict; use warnings; use POSIX qw( ceil ); use List::Util qw( shuffle ); my $basepos = 0; open my $fh, '<', $ARGV[ 0 ] or die "Couldn't open $ARGV[ 0 ] for reading: $!\n"; my $offs = pack 'L', 0; while( my $advance = read $fh, $_, 128 * 1024 ) { s/(?=\n)/$offs .= pack 'L', $basepos + pos(); ""/eg; $basepos += advance; } # we will be looking for "start of following line", so this is # a bogus entry to obviate need for special case at last line $offs .= pack 'L', $basepos; my $total_lines = length( $offs ) / 4; my $been_here = "\0" x ceil( $total_lines / 8 ); my $i = 0; while( $i < $total_lines ) { my $line = int rand $total_lines; next if vec $been_here, $line, 1; my ( $start, $end ) = unpack "x " . ( 4 * $line ) . " L L", $offs; seek $fh, $start, 0; read $fh, $_, $end - $start; print; ++$i; vec( $been_here, $line, 1 ) = 1; }
New code:
#!/usr/bin/perl use strict; use warnings; use POSIX qw( ceil ); use List::Util qw( shuffle ); my $basepos = 0; open my $fh, '<', $ARGV[ 0 ] or die "Couldn't open $ARGV[ 0 ] for reading: $!\n"; my $offs; vec( $offs, 0, 32 ) = 0; my $total_lines = 1; while( my $advance = read $fh, $_, 128 * 1024 ) { s/(?=\n)/vec( $offs, $total_lines++, 32 ) = $basepos + pos(); ""/e +g; $basepos += advance; } # we will be looking for "start of following line", so this is # a bogus entry to obviate need for special case at last line vec( $offs, $total_lines, 32 ) = $basepos; my $been_here = "\0" x ceil( $total_lines / 8 ); my $i = 0; while( $i < $total_lines ) { my $line = int rand $total_lines; next if vec $been_here, $line, 1; my $start = vec $offs, $line, 32; my $end = vec $offs, $line + 1, 32; seek $fh, $start, 0; read $fh, $_, $end - $start; print; ++$i; vec( $been_here, $line, 1 ) = 1; }
Untested, but you get the idea.
Note: I'm not sure this shuffle is entirely fair. I don't think it is.
This uses $offs as a packed array of longs, and $been_here as a packed array of bools. Memory consumption with this approach will be proportional to the number of lines in your file, but not nearly as prohibitively so as with native Perl data structures (an array of numbers takes about 20 bytes per element). If you're really short of memory, you could use Sys::Mmap to use the structures from disk very efficiently; that would also deftly accelerate the initial gathering of offsets.
Makeshifts last the longest.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Randomizing Big Files
by Anonymous Monk on Jan 26, 2005 at 15:59 UTC | |
by Aristotle (Chancellor) on Jan 26, 2005 at 16:18 UTC |