Here is some code I wrote when someone asked this question on irc some time ago (randomizing a big file that is, ignoring the distribute over processes aspect). It's a self-tuning recursive distribute/shuffle/collect. It tries hard to be friendly on memory and filesystem cache and only do streaming disk accesses. The drawback for your case might be that it takes a long time before the first results start coming out of this.
#! /usr/bin/perl -w # Shuffle the lines in a file potentially much bigger than memory # Author: Ton Hospel # License: GNU or artistic use strict; use File::Path; use List::Util qw(shuffle); # Limits on the helper chunks my $max_size = 10e6; my $max_files = 128; my $dir = "Shuffle.$$"; sub big_shuffle { my ($d, $in, $out) = @_; my $files; if (-f $in) { $files = int(($max_size - 1 + -s _) / $max_size); $files = $max_files if $files > $max_files; if ($files == 1 || $files == 2 && $max_size * 1.5 > -s _) { print($out shuffle(<$in>)) || die "Unexpected write error: + $!\n"; return; } } else { $files = $max_files; } my $format = sprintf("%s%s%%0%dd", $d, $d eq $dir ? "/" : ".", length($file +s)); my (@fhs, @names); for (0..$files-1) { $names[$_] = sprintf($format, $_+1); open($fhs[$_], ">", $names[$_]) || die "Could not create $names[$_]: $!"; } local $_; print({$fhs[rand $files]} $_) || die "Unexpected write error: $!\n +" while <$in>; close($_) || die "Unexpected close error: $!\n" for @fhs; close($in) || die "Unexpected input close error: $!\n"; for (@names) { open(my $fh, "<", $_) || die "Could not open $_: $!"; big_shuffle($_, $fh, $out); unlink($_) || die "Could not unlink $_: $!"; } } die "Too many arguments. Usage: $0 [in_file [out_file]]\n" if @ARGV > +2; my ($in, $out) = @ARGV; if (defined($in) && $in ne "") { open(my $fh, "<", $in) || die "Could not open $in: $!"; $in = $fh; } else { $in = \*STDIN; } if (defined($out) && $out ne "") { open(my $fh, ">", $out) || die "Could not create $out: $!"; $out = $fh; } else { $out = \*STDOUT; } mkpath($dir); eval { big_shuffle($dir, $in, $out); close($out) || die "Unexpected output close error: $!\n" }; my $rc = $@; rmtree($dir); die $rc if $rc;

In reply to Re: Randomizing Big Files by thospel
in thread Randomizing Big Files by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.