In response to your points.
First of all merge sorting is not rocket science. However
it does take some care to get right. For instance you
have a rather naive scalability. You did not clean up your temporaries. Your code is not portable.
Secondly there are good reasons that I suggest using File::Temp rather than getting a name and then using that from the POSIX module. The first reason is that there is no way to guarantee no gap between when a temporary name is chosen and when you open that temporary file using the POSIX interface. By contrast with File::Temp if there is any way to do it on your OS, it will. The second reason is that File::Temp provides automatic and secure cleanup (to whatever extent is possible) of temporary files. Both are very good reasons to use File::Temp rather than POSIX.
As for the third point, we were both confused by your code. I had not looked closely enough for the position of the braces. You had not put in the keyword my. (Which strict would have told you.) Therefore your bare block served no purpose but to obfuscate.
As for the matter of personal taste. It takes rather less than 20 minutes to go look at Sort, and find out that there is a wheel that looks like it has been maintained and has had a lot of input for this exact problem. You can, then install and write to that to solve the problem in a cleaner, more efficient, and portable fashion. That is what I call idiomatic coding!
As for the prototype, I see that you used it, but I have to wonder at the gratuitous lack of backwards compatibility. Note that 5.6.0 is not a very good release - why write code that forces people onto it?
Anyways if you want to reinvent the wheel, well two can play at that game. Here is the wheel reinvented more scalably, more flexibly, using fewer filehandles, using strict, doing its own cleanup...
use strict;
use File::Temp qw(tempfile);
use vars qw($eof $max_recs);
# Having a localized $eof is necessary for STDIN to work as expected
$max_recs = 500;
sub get_block {
my $in = shift;
return if $eof;
my $cmp = shift;
my @lines;
my $i;
$eof = 1;
while (<$in>) {
push @lines, $_;
if($max_recs < ++$i) {
$eof = 0;
last;
}
}
my $temp = tempfile();
print $temp sort {$cmp->($a, $b)} @lines;
seek($temp, 0, 0);
return $temp;
}
sub merge_blocks {
my $fh1 = shift;
my $fh2 = shift;
my $cmp = shift;
my $other = <$fh2>;
unless (defined $other) {
# The second handle has nothing
seek ($fh1, 0, 0);
return $fh1;
}
my $out = tempfile();
while (<$fh1>) {
if ($cmp->($_, $other) <= 0) {
print $out $_;
}
else {
print $out $other;
$other = $_;
($fh1, $fh2) = ($fh2, $fh1);
}
}
print $out $other;
print $out $_ while <$fh2>;
seek ($out, 0, 0);
return $out;
}
sub sorted_block {
my $in = shift;
my $cmp = shift;
my $depth = shift;
if (1 < $depth) {
my $fh1 = sorted_block($in, $cmp, $depth - 1);
return unless defined($fh1);
my $fh2 = sorted_block($in, $cmp, $depth - 1);
if( defined($fh2) ) {
return merge_blocks($fh1, $fh2, $cmp);
}
else {
return $fh1;
}
}
else {
return get_block($in, $cmp);
}
}
# If a reference it assumes it is a handle, else it returns an opened
# filehandle
sub get_handle {
my $name = shift;
if (ref($name)) {
return $name;
}
else {
my $fh = do {local *FOO};
open ($fh, $name) or die "Cannot read from $name: $!";
return $fh;
}
}
# Takes 3 arguments, all optional. First a filehandle to read from,
# defaults to STDIN. Secondly a filehandle to write to, defaults to
# STDOUT. And thirdly a subroutine that takes 2 arguments and
# compares them. Defaults to a lexical comparison.
sub merge_sort {
my $fh = get_handle(shift || \*STDIN);
my $out = get_handle(shift || \*STDOUT);
my $cmp = shift || sub {shift cmp shift};
my $sorted = tempfile();
my $depth = 1;
local $eof = 0;
while (my $block = sorted_block($fh, $cmp, $depth++)) {
$sorted = merge_blocks($sorted, $block, $cmp);
}
print $out $_ while <$sorted>;
}
# Example test, sort a file from @ARGV or else STDIN.
merge_sort(shift);
PS: What do I mean by more scalable? Well if you compile in large file support then this can, with a fixed amount of memory, sort terabytes of logfiles with only a few dozen passes.
PPS: With a large number of files your repeated sorts are going to be slow. I don't know whether your 2 pass solution gains more on IO than it loses on the inefficiency of the comparison. I would suspect that it doesn't, but I don't feel like benching that. |