Re: Fast/Efficient Sort for Large Files (use the sort(1) program)

So, you want fast and efficient. Given the sample you have provided, the first two fields are fixed length. This simplifies the problem tremendously:

  if( system( '/usr/bin/sort', '-o', $filename, $filename )) {
    my $exit = $? >> 8;
    my $sig  = $? & 127;
    my $core = $? & 128;

    die "failed to sort $filename: exit code $exit, signal $sig, core?
+ $core\n";
  }
[download]

Were this not the case, I would have pored over the manpage until I could figure out which switches were needed to specify a trickier combination of field specifications, but that would not change the point of the argument that follows.

If it's performance you are after, you have simply no choice but to use the sort(1) utility. It has years of optimisations and bug-fixes. There is nothing you can do in Perl which will approach its speed or reliability, so any custom solution you develop should use it as the basic building block.

What you could do, if sort does not take multiple CPUs into account¹ (my Solaris box is a monoprocessor, so I can't tell), would be to take the file and split out every n million lines into separate files, so that each file is held on a different physical device (and not merely different slices of the same disk), and you have x files in total where x is the number of CPUs you want to sort with. Then spawn enough copies of sort to sort them all in parallel. This will spread the IO load across separate devices. And pray that the OS is smart enough to bind each instance to a separate CPU... When they're all done, mergesort them back together (with sort -m).

<update>Having read grantm's reply elsewhere in this thread, it should be noted that sort has a command-line switch (-y) that can be employed to control its memory consumption.</update>

Another solution would be to split the files into the same directory, using the split(1) command. I would predict that this will be much faster than using Perl to break the file up, although your IO contentions go up (during the sorting phase). You'll have to try all three approaches to see which one is the fastest. I'd go with the simplest solution first myself. The other solutions I have described sound like a lot of effort for what will probably be fairly meagre returns. If you do get something interesting out of it though, I'd be very interested to hear the results.

<update>1. To clarify my statement "if sort does not take multiple CPUs into account," what I meant is that if the file to be sorted exceeds a certain size, sort(1) will perform a disk-based mergesort anyway, which is what the subsequent suggestions spell out in more detail. If you know you are going to go to disk, you may be able to beat its default behaviour, but I wouldn't like to lay money on it.</update>

print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'

Comment on Re: Fast/Efficient Sort for Large Files (use the sort(1) program) Download Code


more useful options
	PerlMonks