Re (tilly) 3: Sorting a large file

Replies are listed 'Best First'.
Re: Re (tilly) 3: Sorting a large file by clintp (Curate) on Feb 22, 2001 at 02:58 UTC
Always read the code before you go off like this. As I recommended in Re (tilly) 1: Sorting a large file, I would tryFile::Sort before the above code. No problem, except this was a demonstration of merge sorting, nothing more, and identified as a quick hack. Merge sorting is not rocket science and a valid solution to the problem at hand. Secondly for temporary files, use File::Temp instead of hand-rolling. For eons, tempfiles have been created with tmpnam() from POSIX. Nothing wrong with that. In fact, I suggested exactly that. Yes, File::Temp might be your solution, but not everyone's. TIMTOWDI And thirdly the fact that you defined your store function inside of your sortsub function suggests some confusion on your part. You cannot get nested functions that way. No, no confusion. Did you read the code? You'll note that's NOT a nested subroutine declaration. The sort subroutine is all on one line. The bare block is to provide a lexical scope. Read it again. As a matter of personal taste I would drop the prototype, use strict, etc, etc, etc. But the three I gave above are the biggies. For something written in 20 minutes, that works, demonstrates the topic at hand, has (or suggests) idiomatic perl coding... I didn't think it was too bad. PS: The prototype is NECESSARY. Comment was there for your benefit. Read it and you'll discover why.	[reply]
Re (tilly) 5: Sorting a large file by tilly (Archbishop) on Feb 22, 2001 at 05:18 UTC
In response to your points. First of all merge sorting is not rocket science. However it does take some care to get right. For instance you have a rather naive scalability. You did not clean up your temporaries. Your code is not portable. Secondly there are good reasons that I suggest using File::Temp rather than getting a name and then using that from the POSIX module. The first reason is that there is no way to guarantee no gap between when a temporary name is chosen and when you open that temporary file using the POSIX interface. By contrast with File::Temp if there is any way to do it on your OS, it will. The second reason is that File::Temp provides automatic and secure cleanup (to whatever extent is possible) of temporary files. Both are very good reasons to use File::Temp rather than POSIX. As for the third point, we were both confused by your code. I had not looked closely enough for the position of the braces. You had not put in the keyword my. (Which strict would have told you.) Therefore your bare block served no purpose but to obfuscate. As for the matter of personal taste. It takes rather less than 20 minutes to go look at Sort, and find out that there is a wheel that looks like it has been maintained and has had a lot of input for this exact problem. You can, then install and write to that to solve the problem in a cleaner, more efficient, and portable fashion. That is what I call idiomatic coding! As for the prototype, I see that you used it, but I have to wonder at the gratuitous lack of backwards compatibility. Note that 5.6.0 is not a very good release - why write code that forces people onto it? Anyways if you want to reinvent the wheel, well two can play at that game. Here is the wheel reinvented more scalably, more flexibly, using fewer filehandles, using strict, doing its own cleanup... use strict; use File::Temp qw(tempfile); use vars qw($eof $max_recs); # Having a localized $eof is necessary for STDIN to work as expected $max_recs = 500; sub get_block { my $in = shift; return if $eof; my $cmp = shift; my @lines; my $i; $eof = 1; while (<$in>) { push @lines, $_; if($max_recs < ++$i) { $eof = 0; last; } } my $temp = tempfile(); print $temp sort {$cmp->($a, $b)} @lines; seek($temp, 0, 0); return $temp; } sub merge_blocks { my $fh1 = shift; my $fh2 = shift; my $cmp = shift; my $other = <$fh2>; unless (defined $other) { # The second handle has nothing seek ($fh1, 0, 0); return $fh1; } my $out = tempfile(); while (<$fh1>) { if ($cmp->($_, $other) <= 0) { print $out $_; } else { print $out $other; $other = $_; ($fh1, $fh2) = ($fh2, $fh1); } } print $out $other; print $out $_ while <$fh2>; seek ($out, 0, 0); return $out; } sub sorted_block { my $in = shift; my $cmp = shift; my $depth = shift; if (1 < $depth) { my $fh1 = sorted_block($in, $cmp, $depth - 1); return unless defined($fh1); my $fh2 = sorted_block($in, $cmp, $depth - 1); if( defined($fh2) ) { return merge_blocks($fh1, $fh2, $cmp); } else { return $fh1; } } else { return get_block($in, $cmp); } } # If a reference it assumes it is a handle, else it returns an opened # filehandle sub get_handle { my $name = shift; if (ref($name)) { return $name; } else { my $fh = do {local FOO}; open ($fh, $name) or die "Cannot read from $name: $!"; return $fh; } } # Takes 3 arguments, all optional. First a filehandle to read from, # defaults to STDIN. Secondly a filehandle to write to, defaults to # STDOUT. And thirdly a subroutine that takes 2 arguments and # compares them. Defaults to a lexical comparison. sub merge_sort { my $fh = get_handle(shift \|\| \STDIN); my $out = get_handle(shift \|\| \*STDOUT); my $cmp = shift \|\| sub {shift cmp shift}; my $sorted = tempfile(); my $depth = 1; local $eof = 0; while (my $block = sorted_block($fh, $cmp, $depth++)) { $sorted = merge_blocks($sorted, $block, $cmp); } print $out $_ while <$sorted>; } # Example test, sort a file from @ARGV or else STDIN. merge_sort(shift); [download] PS: What do I mean by more scalable? Well if you compile in large file support then this can, with a fixed amount of memory, sort terabytes of logfiles with only a few dozen passes. PPS: With a large number of files your repeated sorts are going to be slow. I don't know whether your 2 pass solution gains more on IO than it loses on the inefficiency of the comparison. I would suspect that it doesn't, but I don't feel like benching that.	[reply] [d/l]
Re: Re: Re (tilly) 3: Sorting a large file by chipmunk (Parson) on Feb 22, 2001 at 10:25 UTC
I saw that you had your store function inside a bare block, which provides a lexical scope. The part I can't figure out is why you have that bare block there, since you don't actually have anything lexical in that scope...	[reply]