sorting large data

thealienz1 has asked for the wisdom of the Perl Monks concerning the following question:

Good marrow bright stars... don't ask me what it is supposed to mean I am in a weird mood today. I have a little efficientcy question for all of you.

I have this big data file (45mb) that has one item per line, and they are just words. And all I want to do is use a normal string sort of them. Of course my computer cannot handle a 45mb file into memory, and then have time to sort it. So, what I did is break down everything into seperate files but its first letter. And then have it sort those files and then just combined those files back together.

Here is the code:

use strict;
print "Shifting all words into letter data files - ";
open(DATA,"all.txt") || die "cannot open all.txt for input: $!";
while(<DATA>) {
    my $word = $_;
    chomp($word);
    $word=~s/^\s+//;
    my $letter = lc(substr($word,0,1));
    if($letter!~/[a-z]/) {
        $letter = "___";
    }
    if(-e "$letter.dat") {
        open(FILE,">>$letter.dat") || die "cannot $letter.dat for appe
+nd: $!";
    } else{
        open(FILE,">$letter.dat") || die "cannot $letter.dat for outpu
+t: $!";
    }
    print FILE "$word\n";
    close(FILE);
}
close(DATA);
print "done\n";

print "Organizing letter file alphabetically - \n";
open(DATA,">all1.txt") || die "cannot do $!";
foreach my $filename (sort {lc($a) cmp lc($b)} <*.dat>) {
    print "\topening $filename - ";
    my @words = ();
    open(FILE,$filename) || die "cannot do $!";
    while(<FILE>) {
        my $word = $_;
        chomp($word);
        push(@words,$word);
    }
    close(FILE);
    print "\t\tsorting - ";
    @words = sort {lc($a) cmp lc($b)} @words;
    print "done\n";
    print "\t\tremoving duplicated - ";
    my $prev = "not equal to $words[0]";
    @words = grep($_ ne $prev && ($prev = $_, 1), @words);
    print "done\n";
    foreach my $word (@words) {
        print DATA "$word\n";
    }
    print "\tdone\n";
}
print "done\n";
[download]

Anyone have any ideas to make it go faster? Because right now it takes a good half an hour or more... and the file will eventually get larger. Thanks... :)

Comment on sorting large data Download Code

Replies are listed 'Best First'.
Re: sorting large data by Abigail-II (Bishop) on Jul 23, 2002 at 16:14 UTC
If you are on a Unix, POSIX, Windows + cygwin (or some other package with Unix tools), VMS + whatever the name of their Unix tools package is, or some other decent system, you could speed up your program, while bringing it down to one line: `system "sort in_file > out_file";` [download] Unix sort is specially equiped to sort large files (large on disk compared to available RAM). Abigail	[reply] [d/l]
Re: Re: sorting large data by simeon2000 (Monk) on Jul 23, 2002 at 16:24 UTC
But how would you do this kind of sort if you don't have access to gnu sort? I have pondered this for a while and have not come up with any efficient solutions. Certainly loading up a 45MB text file into RAM is not the answer. "Falling in love with map, one block at a time." - simeon2000	[reply]
Re: sorting large data by Abigail-II (Bishop) on Jul 23, 2002 at 16:32 UTC
Well, if don't have access to GNU sort, you can always try one of the many other implementations of Unix sort.... ;-). Anyway, you would do as Unix sort would do. Split up the data in sizes that you can swallow (how much that is depends from system to system). Sort that, and store it in a temporary file. Now you have a bunch of sorted files - and you have to merge them. You even might have to do this recursively. Read Knuth if you want to know everything about merge sort. Abigail	[reply]
Re: Re: sorting large data by simeon2000 (Monk) on Jul 23, 2002 at 16:51 UTC
Re: Re: Re: sorting large data by Fletch (Bishop) on Jul 23, 2002 at 16:29 UTC
Any good algorithms book should cover sorting. See Knuth volume 2, or Orwant et al Mastering Algorithms with Perl.	[reply]
Re: sorting large data by RMGir (Prior) on Jul 23, 2002 at 17:16 UTC
Your main problem that I can see is that you're reopening the "letter" files for every word, then closing them again. That's obviously going to slow your processing down a LOT. Try this, using File::Temp to simplify your problem: use File::Temp qw/ tempfile tempdir /; # this is going to look a lot like # the File::Temp perldoc page my $dir = tempdir(CLEANUP=>1); # basic approach: we're going to open one temp # file per initial letter, then seek back to the # start of the temp file when its time to read # from it. We let File::Temp handle deleting # the temp files. my %filehandles; my $defaultKey = "___"; for('a'..'z',$defaultKey) { $filehandles{$_}=tempfile(DIR=>$dir); } print "Shifting all words into letter data files - "; open(DATA,"all.txt") \|\| die "cannot open all.txt for input: $!"; while(<DATA>) { my $word = $_; chomp($word); $word=~s/^\s+//; my $letter = lc(substr($word,0,1)); if($letter!~/[a-z]/) { $letter = $defaultKey; #"___"; } my $fh=$filehandles{$letter} \|\| die "No file handle for $letter"; print $fh "$word\n"; } close(DATA); print "done\n"; print "Organizing letter file alphabetically - \n"; open(DATA,">all1.txt") \|\| die "cannot do $!"; foreach my $letter (sort keys %filehandles) { print "\tseeking $letter - "; my $fh=$filehandles{$letter} \|\| die "No file handle for $letter"; my @words = (); #open(FILE,$filename) \|\| die "cannot do $!"; # seek back to the start of this temp file to read from # it seek $fh, 0, SEEK_SET; while(<$fh>) { my $word = $_; chomp($word); push(@words,$word); } #close(FILE); print "\t\tsorting - "; @words = sort {lc($a) cmp lc($b)} @words; print "done\n"; print "\t\tremoving duplicates - "; my $prev = "not equal to $words[0]"; @words = grep($_ ne $prev && ($prev = $_, 1), @words); print "done\n"; #foreach my $word (@words) { # print DATA "$word\n"; #} # write all the words in one call, should be faster print DATA join "\n",@words; print DATA "\n" if @words; print "\tdone\n"; } print "done\n"; [download] (Edit: Added comments, and I'd forgotten the DIR parameter on tempfile.) -- Mike	[reply] [d/l]
Re: sorting large data by Ionitor (Scribe) on Jul 23, 2002 at 16:35 UTC
Have you tried sorting it in memory? I've sorted word lists nearly that large in much less than a half hour. It will likely go pretty quickly, unless your free RAM + free swap space is less than 100 MB. If it is, it's probably worth the small investment for more RAM. Remember, your time is worth something too, and we have more than 640K memory these days.	[reply]
Re: sorting large data by thealienz1 (Pilgrim) on Jul 23, 2002 at 16:12 UTC
And also to let you guys know I put in a little code to remove duplicates... stold it from perlfaq4.	[reply]


laziness, impatience, and hubris
	PerlMonks