in reply to Sorting a large file

Very old problem with an easy solution: you sort the file in chunks and the merge the chunks back together. This is called a merge sort. The chunks can be as large (or small) as you like.

Replies are listed 'Best First'.
Re: Re: Sorting a large file
by clintp (Curate) on Feb 21, 2001 at 23:16 UTC
    This was hacked up while I ate lunch today. Give it a maximum number of records per file and very large input source and awaaaaay you go.

    Use your own sortsub of course. And your own input records (I used random numbers). Otherwise, it's fit to use.

    #!/usr/bin/perl -w # Mergesort use IO::Handle; # For the ->getline require 5.6.0; # Sort sub prototypes $recs=13; # Total number of records to sort..... # Leave out of the real thing $max=5; # Maximum number of records per merge file @files=(); # The prototype is needed because we want lexical # values in the sort because we're using it as a # regular comparison and as a sort sub. sub sortsub ($$) { my($c,$d)=@_; return $c<=>$d; } { # Should be POSIX::tmpnam. But I'm lazy at the moment. # (Under UNIX you can even re-use the same name each # time and just unlink it after the push()!) $tempname="fooaa"; sub store { my($a)=@_; my $f; open($f, "+>/tmp/$tempname") || die; print $f sort sortsub @$a; # Sort small pile seek $f, 0, 0 or warn "Can't seek: $!"; push(@files, { fh => $f, queued => scalar <$f>, }); $tempname++; } } # This is where you'd read the input file to exhaustion # I'm just making up data. The important part is the block itself. while($_=rand() . "\n", $recs--) { push(@sortarr, $_); if (@sortarr==$max) { store(\@sortarr); @sortarr=(); } } store(\@sortarr) if @sortarr; # Store the leftovers LOOP: { ($lowest)=(sort { sortsub($a->{queued}, $b->{queued}); } grep(defined $_->{queued}, @files) )[0]; last unless defined $lowest->{queued}; # Do your processing here print $lowest->{queued}; $lowest->{queued}=$lowest->{fh}->getline(); redo; }
      As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code.

      Secondly for temporary files, use File::Temp instead of hand-rolling.

      And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way.

      As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies.

        Always read the code before you go off like this.

        As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code.
        No problem, except this was a demonstration of merge sorting, nothing more, and identified as a quick hack. Merge sorting is not rocket science and a valid solution to the problem at hand.
        Secondly for temporary files, use File::Temp instead of hand-rolling.
        For eons, tempfiles have been created with tmpnam() from POSIX. Nothing wrong with that. In fact, I suggested exactly that. Yes, File::Temp might be your solution, but not everyone's. TIMTOWDI

        And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way.
        No, no confusion. Did you read the code? You'll note that's NOT a nested subroutine declaration. The sort subroutine is all on one line. The bare block is to provide a lexical scope. Read it again.

        As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies.
        For something written in 20 minutes, that works, demonstrates the topic at hand, has (or suggests) idiomatic perl coding... I didn't think it was too bad.

        PS: The prototype is NECESSARY. Comment was there for your benefit. Read it and you'll discover why.