comment on

Essentially, you'd code something that (a) splits your file into chunks small enough for Perl to load into memory and sort relatively efficiently and then (b) merges the files using a line-by-line method. Since I'm a glutton for punishment, here's a fully-working script I just spent the last half hour writing.

perl process.pl in.txt out.txt ./temp 200

use strict;
use warnings;

die "Arguments = in, out, temp dir, sort max in MB.\n"
    if $#ARGV != 3;

my ($in, $out, $temp, $max) = @ARGV;

die "$in does not exist.\n"
    if !-e $in;
die "Can't open $in for read.\n"
    if !open(FH, $in);

die "$temp does not exist, or is not a directory."
    if !-d $temp;
$temp =~ s|/$||;

$max *= 1024 * 1024;

my (@b1, $size, $n, @t, $t1, $t2);

$size = 0; $n = 0;
while (<FH>) {
    push @b1, $_;
    $size += length $_;
    if ($size >= $max) {
        ### Over limit, write chunk
        writeTemp();
        @b1 = ();
        $size = 0;
    }
}
### Write whatever's left in buffer
writeTemp() if $#b1 != -1;

### Using this so I don't have to write it twice in the code
sub writeTemp {
    $n++;
    die "Unable to open $temp/$n.txt for write.\n"
        if !open (FHO, ">$temp/$n.txt");
    @b1 = sort @b1;
    print FHO join('', @b1);
    print "$in => $temp/$n.txt ($size)\n";
}

@t = (1..$n);
while ($#t > 0) {
    $t1 = shift @t;
    $t2 = shift @t;
    $n++;
    mergeFiles("$temp/$t1.txt", "$temp/$t2.txt", "$temp/$n.txt");
    print "$temp/$t1.txt + $temp/$t2.txt => $temp/$n.txt\n";
    unlink "$temp/$t1.txt";
    unlink "$temp/$t2.txt";
    push @t, $n;
}
`mv $temp/$n.txt $out`;
print "$temp/$n.txt => $out\n";

sub mergeFiles {
    my ($f1, $f2, $fo) = @_;
    
    die "Unable to open $f1 for read."
        if !open(FH1, $f1);
    die "Unable to open $f2 for read."
        if !open(FH2, $f2);
    die "Unable to open $fo for write."
        if !open(FHO, ">$fo");

    my $l1 = <FH1>;
    my $l2 = <FH2>;

    while ($l1 && $l2) {
        if ($l1 lt $l2) {
            print FHO $l1;
            $l1 = <FH1>;
        }
        else {
            print FHO $l2;
            $l2 = <FH2>;
        }
    }

    local $/ = undef;

    if ($l1) {
        print FHO $l1;
        $l1 = <FH1>;
        print FHO $l1 if $l1;
        
    }
    else {
        print FHO $l2;
        $l2 = <FH2>;
        print FHO $l2 if $l2;
    }
}
[download]

Could probably be improved at a few points on read speed, but that shouldn't be the primary time sink.

If we assume a GB of spare RAM and an input file of 20 GB, you might want to limit it to as little as 200 MB per file if the records are relatively short (one number, for instance), in which case you'd have 100 chunks and each chunk might take up to a few hundred seconds to process initially and be merged later on depending on how slow your computer is. A lot of time, but you can just leave it running for an hour or two and come back to it later.

In reply to Re: Sort command equivalent in perl by TJPride
in thread Sort command equivalent in perl by sandeepau

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.