Re: Fast/Efficient Sort for Large Files

FWIW. The code below, a split/sort/merge sort (possibly a Merge sort, but I don't have Knuth handy), processes a 1 million record simulation of your data on a 233MHz/256MB machine with a badly fragged disc in a reasonable time.

With 8 cpu's, this algorithm begs for the addition of forking at the split file sort position. Not hard to do, but entirely worthless exercise for me with my hardware.

On the kind of hardware you have it should fly.

It takes two parameters, infile & outfile, and 1 option -N=n, which is the number of characters at the front of the line to use to control the number of temp files used. I used 2, for approx. 100 files of around ~~4 MB~~(*should've used that commify routine on the numbers:) 400Kb each, but with your hardware, the default of 1 for 10 files should do.

You'll need to check the sort parms as I made a goof when gen'ing my test data. I think they are ok as is, but check. Maybe I'll get around to making the keys specifiable from the command line.

There is a lot that could be done to make this stronger and more efficient, but it's somewhere for you to start should you feel so inclined to write your own.

#! perl -sw
use vars qw/$N/;
use strict;
no strict 'refs';
$|++;

my $reclen = 37; #! Adjust to suit your records/line ends.
$N = $N || 1;

warn "Usage: $0 [-N=n] file\n" and exit(-1) unless @ARGV;

warn "Reading input file $ARGV[0] ", -s $ARGV[0],  "\n";

if ( not defined $ARGV[1] ) {
    warn "Output file not specified a Continue[N|y]?";
    exit -1 if <STDIN> !~ /^Y/i;
}

$/= \$reclen;

open INPUT, '<', $ARGV[0] or die $!, $ARGV[0];
binmode(INPUT);

my (@fhs);
while ( <INPUT> ) {
    my $key = substr($_, 0, $::N);
    if (not defined $fhs[$key]) {
        $fhs[$key] = "temp.$key";
        warn( "\rCreating file: $fhs[$key] ");
        open( $fhs[$key], ">$fhs[$key]")
            or die( "Could create $fhs[$key]: $!");
        binmode($fhs[$key]);
    }
    print {$fhs[$key]} $_;
}
#! Get rid of unused filehandles or those that reference zero length f
+iles.
@fhs = grep{ $_ and ! -z $_} @fhs;

close $_ for @fhs;
close INPUT;

warn "Split made to: ", scalar @fhs, " files\n";

#! Sort the split files on the first & second field
for my $fh (@fhs) {
    warn "$fh: reading;...";
    open $fh, "<$fh" or die $!;
    binmode($fh);
    my @recs = <$fh>;
    close $fh;

    warn " sorting: ", scalar @recs, " recs;...";
    @recs = sort{ substr($b, 0, 5) <=> substr($a, 0, 5)
            ||    substr($b, 6, 11) <=> substr($a, 6, 11) } @recs;

    warn " writing;...";
    open $fh, ">$fh" or die $!;
    binmode($fh);
    print $fh @recs;
    close $fh;

    warn "done;\n";
}

warn "Merging files: ";
*SORTED = *STDOUT;
open SORTED, '>', $ARGV[1] and binmode(SORTED) or die $! if $ARGV[1];
for my $fh (reverse @fhs) {
    warn " $fh;";
    open $fh, "<$fh" and binmode($fh) or die $!;
    print SORTED <$fh>;
    close $fh;
}
warn "\nClosing sorted file: sorted\n";
close SORTED;
warn "Deleting temp files\n";
unlink $_ or warn "Couldn't unlink $_\n" for @fhs;
warn "Done.\n";

exit (0);
[download]

Examine what is said, not who speaks.

Comment on Re: Fast/Efficient Sort for Large Files Download Code