in reply to sorting very large text files
If the file is not so big that the keys will not fit into memory you can do this:
On the cooked data I tested, I got the following timings:#!/usr/bin/perl use strict; use warnings; use vars qw( @data $prev $IN @sorted $OUT); $prev = 0; open($IN, '<', 'data'); while (<$IN>) { chomp; my $next = tell($IN); my ($key) = /^([^\s]*)/; push(@data, [ $key, $prev, $next - $prev ]); $prev = $next; } close($IN); @sorted = sort( { $a->[0] cmp $b->[0] } @data); open($IN, '<', 'data'); open($OUT, '>', 'sorteddata'); for (@sorted) { seek($IN, $_->[1], 0); sysread($IN, my $line, $_->[2]); print $OUT $line; } close($IN); close($OUT);
Gnu Sort: # time sort --temporary-directory=/opt data > sort1 real 0m24.698s user 0m22.539s sys 0m1.950s Perl Sort: # time perl sort.pl real 0m55.900s user 0m39.897s sys 0m6.430sThe data file I used had a wc of:
#wc data 4915200 34406400 383385600 dataI am surprised that this Perl script is only half the speed of Gnu sort on this data. I think that on a bigger data set, with long lines, it might even be able to sort faster that Gnu Sort.
UPDATE: Most of the time seems to be being spent in the output loop. All of the seeks seems to really slow things down.
-- gam3
A picture is worth a thousand words, but takes 200K.
A picture is worth a thousand words, but takes 200K.
In Section
Seekers of Perl Wisdom