comment on

In response to your points.

First of all merge sorting is not rocket science. However it does take some care to get right. For instance you have a rather naive scalability. You did not clean up your temporaries. Your code is not portable.

Secondly there are good reasons that I suggest using File::Temp rather than getting a name and then using that from the POSIX module. The first reason is that there is no way to guarantee no gap between when a temporary name is chosen and when you open that temporary file using the POSIX interface. By contrast with File::Temp if there is any way to do it on your OS, it will. The second reason is that File::Temp provides automatic and secure cleanup (to whatever extent is possible) of temporary files. Both are very good reasons to use File::Temp rather than POSIX.

As for the third point, we were both confused by your code. I had not looked closely enough for the position of the braces. You had not put in the keyword my. (Which strict would have told you.) Therefore your bare block served no purpose but to obfuscate.

As for the matter of personal taste. It takes rather less than 20 minutes to go look at Sort, and find out that there is a wheel that looks like it has been maintained and has had a lot of input for this exact problem. You can, then install and write to that to solve the problem in a cleaner, more efficient, and portable fashion. That is what I call idiomatic coding!

As for the prototype, I see that you used it, but I have to wonder at the gratuitous lack of backwards compatibility. Note that 5.6.0 is not a very good release - why write code that forces people onto it?

Anyways if you want to reinvent the wheel, well two can play at that game. Here is the wheel reinvented more scalably, more flexibly, using fewer filehandles, using strict, doing its own cleanup...

use strict;
use File::Temp qw(tempfile);
use vars qw($eof $max_recs);
# Having a localized $eof is necessary for STDIN to work as expected
$max_recs = 500;

sub get_block {
  my $in = shift;
  return if $eof;
  my $cmp = shift;
  my @lines;
  my $i;
  $eof = 1;
  while (<$in>) {
    push @lines, $_;
    if($max_recs < ++$i) {
      $eof = 0;
      last;
    }
  }
  my $temp = tempfile();
  print $temp sort {$cmp->($a, $b)} @lines;
  seek($temp, 0, 0);
  return $temp;
}

sub merge_blocks {
  my $fh1 = shift;
  my $fh2 = shift;
  my $cmp = shift;
  my $other = <$fh2>;
  unless (defined $other) {
    # The second handle has nothing
    seek ($fh1, 0, 0);
    return $fh1;
  }
  my $out = tempfile();
  while (<$fh1>) {
    if ($cmp->($_, $other) <= 0) {
      print $out $_;
    }
    else {
      print $out $other;
      $other = $_;
      ($fh1, $fh2) = ($fh2, $fh1);
    }
  }
  print $out $other;
  print $out $_ while <$fh2>;
  seek ($out, 0, 0);
  return $out;
}

sub sorted_block {
  my $in = shift;
  my $cmp = shift;
  my $depth = shift;
  if (1 < $depth) {
    my $fh1 = sorted_block($in, $cmp, $depth - 1);
    return unless defined($fh1);
    my $fh2 = sorted_block($in, $cmp, $depth - 1);
    if( defined($fh2) ) {
      return merge_blocks($fh1, $fh2, $cmp);
    }
    else {
      return $fh1;
    }
  }
  else {
    return get_block($in, $cmp);
  }
}


# If a reference it assumes it is a handle, else it returns an opened
# filehandle
sub get_handle {
  my $name = shift;
  if (ref($name)) {
    return $name;
  }
  else {
    my $fh = do {local *FOO};
    open ($fh, $name) or die "Cannot read from $name: $!";
    return $fh;
  }
}

# Takes 3 arguments, all optional.  First a filehandle to read from,
# defaults to STDIN.  Secondly a filehandle to write to, defaults to
# STDOUT.  And thirdly a subroutine that takes 2 arguments and
# compares them.  Defaults to a lexical comparison.
sub merge_sort {
  my $fh = get_handle(shift || \*STDIN);
  my $out = get_handle(shift || \*STDOUT);
  my $cmp = shift || sub {shift cmp shift};
  my $sorted = tempfile();
  my $depth = 1;
  local $eof = 0;
  while (my $block = sorted_block($fh, $cmp, $depth++)) {
    $sorted = merge_blocks($sorted, $block, $cmp);
  }
  print $out $_ while <$sorted>;
}

# Example test, sort a file from @ARGV or else STDIN.
merge_sort(shift);
[download]

PS: What do I mean by more scalable? Well if you compile in large file support then this can, with a fixed amount of memory, sort terabytes of logfiles with only a few dozen passes.

PPS: With a large number of files your repeated sorts are going to be slow. I don't know whether your 2 pass solution gains more on IO than it loses on the inefficiency of the comparison. I would suspect that it doesn't, but I don't feel like benching that.

In reply to Re (tilly) 5: Sorting a large file by tilly
in thread Sorting a large file by c-era

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.