in reply to Re: Sorting a large file
in thread Sorting a large file

This was hacked up while I ate lunch today. Give it a maximum number of records per file and very large input source and awaaaaay you go.

Use your own sortsub of course. And your own input records (I used random numbers). Otherwise, it's fit to use.

#!/usr/bin/perl -w # Mergesort use IO::Handle; # For the ->getline require 5.6.0; # Sort sub prototypes $recs=13; # Total number of records to sort..... # Leave out of the real thing $max=5; # Maximum number of records per merge file @files=(); # The prototype is needed because we want lexical # values in the sort because we're using it as a # regular comparison and as a sort sub. sub sortsub ($$) { my($c,$d)=@_; return $c<=>$d; } { # Should be POSIX::tmpnam. But I'm lazy at the moment. # (Under UNIX you can even re-use the same name each # time and just unlink it after the push()!) $tempname="fooaa"; sub store { my($a)=@_; my $f; open($f, "+>/tmp/$tempname") || die; print $f sort sortsub @$a; # Sort small pile seek $f, 0, 0 or warn "Can't seek: $!"; push(@files, { fh => $f, queued => scalar <$f>, }); $tempname++; } } # This is where you'd read the input file to exhaustion # I'm just making up data. The important part is the block itself. while($_=rand() . "\n", $recs--) { push(@sortarr, $_); if (@sortarr==$max) { store(\@sortarr); @sortarr=(); } } store(\@sortarr) if @sortarr; # Store the leftovers LOOP: { ($lowest)=(sort { sortsub($a->{queued}, $b->{queued}); } grep(defined $_->{queued}, @files) )[0]; last unless defined $lowest->{queued}; # Do your processing here print $lowest->{queued}; $lowest->{queued}=$lowest->{fh}->getline(); redo; }

Replies are listed 'Best First'.
Re (tilly) 3: Sorting a large file
by tilly (Archbishop) on Feb 22, 2001 at 00:02 UTC
    As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code.

    Secondly for temporary files, use File::Temp instead of hand-rolling.

    And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way.

    As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies.

      Always read the code before you go off like this.

      As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code.
      No problem, except this was a demonstration of merge sorting, nothing more, and identified as a quick hack. Merge sorting is not rocket science and a valid solution to the problem at hand.
      Secondly for temporary files, use File::Temp instead of hand-rolling.
      For eons, tempfiles have been created with tmpnam() from POSIX. Nothing wrong with that. In fact, I suggested exactly that. Yes, File::Temp might be your solution, but not everyone's. TIMTOWDI

      And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way.
      No, no confusion. Did you read the code? You'll note that's NOT a nested subroutine declaration. The sort subroutine is all on one line. The bare block is to provide a lexical scope. Read it again.

      As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies.
      For something written in 20 minutes, that works, demonstrates the topic at hand, has (or suggests) idiomatic perl coding... I didn't think it was too bad.

      PS: The prototype is NECESSARY. Comment was there for your benefit. Read it and you'll discover why.

        In response to your points.

        First of all merge sorting is not rocket science. However it does take some care to get right. For instance you have a rather naive scalability. You did not clean up your temporaries. Your code is not portable.

        Secondly there are good reasons that I suggest using File::Temp rather than getting a name and then using that from the POSIX module. The first reason is that there is no way to guarantee no gap between when a temporary name is chosen and when you open that temporary file using the POSIX interface. By contrast with File::Temp if there is any way to do it on your OS, it will. The second reason is that File::Temp provides automatic and secure cleanup (to whatever extent is possible) of temporary files. Both are very good reasons to use File::Temp rather than POSIX.

        As for the third point, we were both confused by your code. I had not looked closely enough for the position of the braces. You had not put in the keyword my. (Which strict would have told you.) Therefore your bare block served no purpose but to obfuscate.

        As for the matter of personal taste. It takes rather less than 20 minutes to go look at Sort, and find out that there is a wheel that looks like it has been maintained and has had a lot of input for this exact problem. You can, then install and write to that to solve the problem in a cleaner, more efficient, and portable fashion. That is what I call idiomatic coding!

        As for the prototype, I see that you used it, but I have to wonder at the gratuitous lack of backwards compatibility. Note that 5.6.0 is not a very good release - why write code that forces people onto it?

        Anyways if you want to reinvent the wheel, well two can play at that game. Here is the wheel reinvented more scalably, more flexibly, using fewer filehandles, using strict, doing its own cleanup...

        use strict; use File::Temp qw(tempfile); use vars qw($eof $max_recs); # Having a localized $eof is necessary for STDIN to work as expected $max_recs = 500; sub get_block { my $in = shift; return if $eof; my $cmp = shift; my @lines; my $i; $eof = 1; while (<$in>) { push @lines, $_; if($max_recs < ++$i) { $eof = 0; last; } } my $temp = tempfile(); print $temp sort {$cmp->($a, $b)} @lines; seek($temp, 0, 0); return $temp; } sub merge_blocks { my $fh1 = shift; my $fh2 = shift; my $cmp = shift; my $other = <$fh2>; unless (defined $other) { # The second handle has nothing seek ($fh1, 0, 0); return $fh1; } my $out = tempfile(); while (<$fh1>) { if ($cmp->($_, $other) <= 0) { print $out $_; } else { print $out $other; $other = $_; ($fh1, $fh2) = ($fh2, $fh1); } } print $out $other; print $out $_ while <$fh2>; seek ($out, 0, 0); return $out; } sub sorted_block { my $in = shift; my $cmp = shift; my $depth = shift; if (1 < $depth) { my $fh1 = sorted_block($in, $cmp, $depth - 1); return unless defined($fh1); my $fh2 = sorted_block($in, $cmp, $depth - 1); if( defined($fh2) ) { return merge_blocks($fh1, $fh2, $cmp); } else { return $fh1; } } else { return get_block($in, $cmp); } } # If a reference it assumes it is a handle, else it returns an opened # filehandle sub get_handle { my $name = shift; if (ref($name)) { return $name; } else { my $fh = do {local *FOO}; open ($fh, $name) or die "Cannot read from $name: $!"; return $fh; } } # Takes 3 arguments, all optional. First a filehandle to read from, # defaults to STDIN. Secondly a filehandle to write to, defaults to # STDOUT. And thirdly a subroutine that takes 2 arguments and # compares them. Defaults to a lexical comparison. sub merge_sort { my $fh = get_handle(shift || \*STDIN); my $out = get_handle(shift || \*STDOUT); my $cmp = shift || sub {shift cmp shift}; my $sorted = tempfile(); my $depth = 1; local $eof = 0; while (my $block = sorted_block($fh, $cmp, $depth++)) { $sorted = merge_blocks($sorted, $block, $cmp); } print $out $_ while <$sorted>; } # Example test, sort a file from @ARGV or else STDIN. merge_sort(shift);
        PS: What do I mean by more scalable? Well if you compile in large file support then this can, with a fixed amount of memory, sort terabytes of logfiles with only a few dozen passes.

        PPS: With a large number of files your repeated sorts are going to be slow. I don't know whether your 2 pass solution gains more on IO than it loses on the inefficiency of the comparison. I would suspect that it doesn't, but I don't feel like benching that.

        I saw that you had your store function inside a bare block, which provides a lexical scope. The part I can't figure out is why you have that bare block there, since you don't actually have anything lexical in that scope...