Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Fast/Efficient Sort for Large Files (use the sort(1) program)

by grinder (Bishop)
on Dec 19, 2002 at 08:38 UTC ( [id://221065]=note: print w/replies, xml ) Need Help??


in reply to Fast/Efficient Sort for Large Files

So, you want fast and efficient. Given the sample you have provided, the first two fields are fixed length. This simplifies the problem tremendously:

if( system( '/usr/bin/sort', '-o', $filename, $filename )) { my $exit = $? >> 8; my $sig = $? & 127; my $core = $? & 128; die "failed to sort $filename: exit code $exit, signal $sig, core? + $core\n"; }

Were this not the case, I would have pored over the manpage until I could figure out which switches were needed to specify a trickier combination of field specifications, but that would not change the point of the argument that follows.

If it's performance you are after, you have simply no choice but to use the sort(1) utility. It has years of optimisations and bug-fixes. There is nothing you can do in Perl which will approach its speed or reliability, so any custom solution you develop should use it as the basic building block.

What you could do, if sort does not take multiple CPUs into account1 (my Solaris box is a monoprocessor, so I can't tell), would be to take the file and split out every n million lines into separate files, so that each file is held on a different physical device (and not merely different slices of the same disk), and you have x files in total where x is the number of CPUs you want to sort with. Then spawn enough copies of sort to sort them all in parallel. This will spread the IO load across separate devices. And pray that the OS is smart enough to bind each instance to a separate CPU... When they're all done, mergesort them back together (with sort -m).

<update>Having read grantm's reply elsewhere in this thread, it should be noted that sort has a command-line switch (-y) that can be employed to control its memory consumption.</update>

Another solution would be to split the files into the same directory, using the split(1) command. I would predict that this will be much faster than using Perl to break the file up, although your IO contentions go up (during the sorting phase). You'll have to try all three approaches to see which one is the fastest. I'd go with the simplest solution first myself. The other solutions I have described sound like a lot of effort for what will probably be fairly meagre returns. If you do get something interesting out of it though, I'd be very interested to hear the results.

<update>1. To clarify my statement "if sort does not take multiple CPUs into account," what I meant is that if the file to be sorted exceeds a certain size, sort(1) will perform a disk-based mergesort anyway, which is what the subsequent suggestions spell out in more detail. If you know you are going to go to disk, you may be able to beat its default behaviour, but I wouldn't like to lay money on it.</update>


print@_{sort keys %_},$/if%_=split//,'= & *a?b:e\f/h^h!j+n,o@o;r$s-t%t#u'

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://221065]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2024-04-25 08:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found