comment on

This was hacked up while I ate lunch today. Give it a maximum number of records per file and very large input source and awaaaaay you go.

Use your own sortsub of course. And your own input records (I used random numbers). Otherwise, it's fit to use.

#!/usr/bin/perl -w

# Mergesort

use IO::Handle;  # For the ->getline
require 5.6.0;   # Sort sub prototypes

$recs=13;   # Total number of records to sort.....
            # Leave out of the real thing
$max=5;     # Maximum number of records per merge file
@files=();

# The prototype is needed because we want lexical
#    values in the sort because we're using it as a
#    regular comparison and as a sort sub.
sub sortsub ($$) { my($c,$d)=@_; return $c<=>$d;  }
{
        # Should be POSIX::tmpnam.  But I'm lazy at the moment.
        #   (Under UNIX you can even re-use the same name each
        #   time and just unlink it after the push()!)
        $tempname="fooaa";
        sub store {
                my($a)=@_;
                my $f;
                open($f, "+>/tmp/$tempname") || die;
                print $f sort sortsub @$a;  # Sort small pile
                seek $f, 0, 0 or warn "Can't seek: $!";
                push(@files, {
                        fh => $f,
                        queued => scalar <$f>,
                });
                $tempname++;
        }
}

# This is where you'd read the input file to exhaustion
# I'm just making up data.  The important part is the block itself.
while($_=rand() . "\n", $recs--) {
        push(@sortarr, $_);
        if (@sortarr==$max) {
                store(\@sortarr);
                @sortarr=();
        }
}
store(\@sortarr) if @sortarr;  # Store the leftovers

LOOP: {
        ($lowest)=(sort {
            sortsub($a->{queued}, $b->{queued});
            } grep(defined $_->{queued}, @files) )[0];
        last unless defined $lowest->{queued};
    
    # Do your processing here
        print $lowest->{queued};
    
        $lowest->{queued}=$lowest->{fh}->getline();
        redo;
}
[download]

In reply to Re: Re: Sorting a large file by clintp
in thread Sorting a large file by c-era

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.