http://qs1969.pair.com?node_id=815874


in reply to sorting very large text files

If the file is not so big that the keys will not fit into memory you can do this:
#!/usr/bin/perl use strict; use warnings; use vars qw( @data $prev $IN @sorted $OUT); $prev = 0; open($IN, '<', 'data'); while (<$IN>) { chomp; my $next = tell($IN); my ($key) = /^([^\s]*)/; push(@data, [ $key, $prev, $next - $prev ]); $prev = $next; } close($IN); @sorted = sort( { $a->[0] cmp $b->[0] } @data); open($IN, '<', 'data'); open($OUT, '>', 'sorteddata'); for (@sorted) { seek($IN, $_->[1], 0); sysread($IN, my $line, $_->[2]); print $OUT $line; } close($IN); close($OUT);
On the cooked data I tested, I got the following timings:
Gnu Sort:
# time sort --temporary-directory=/opt data > sort1

real    0m24.698s
user    0m22.539s
sys     0m1.950s

Perl Sort:
# time perl sort.pl 

real    0m55.900s
user    0m39.897s
sys     0m6.430s
The data file I used had a wc of:
#wc data
  4915200  34406400 383385600 data
I am surprised that this Perl script is only half the speed of Gnu sort on this data. I think that on a bigger data set, with long lines, it might even be able to sort faster that Gnu Sort.

UPDATE: Most of the time seems to be being spent in the output loop. All of the seeks seems to really slow things down.

-- gam3
A picture is worth a thousand words, but takes 200K.