Good marrow bright stars... don't ask me what it is supposed to mean I am in a weird mood today. I have a little efficientcy question for all of you.
I have this big data file (45mb) that has one item per line, and they are just words. And all I want to do is use a normal string sort of them. Of course my computer cannot handle a 45mb file into memory, and then have time to sort it. So, what I did is break down everything into seperate files but its first letter. And then have it sort those files and then just combined those files back together.
Here is the code:
use strict;
print "Shifting all words into letter data files - ";
open(DATA,"all.txt") || die "cannot open all.txt for input: $!";
while(<DATA>) {
my $word = $_;
chomp($word);
$word=~s/^\s+//;
my $letter = lc(substr($word,0,1));
if($letter!~/[a-z]/) {
$letter = "___";
}
if(-e "$letter.dat") {
open(FILE,">>$letter.dat") || die "cannot $letter.dat for appe
+nd: $!";
} else{
open(FILE,">$letter.dat") || die "cannot $letter.dat for outpu
+t: $!";
}
print FILE "$word\n";
close(FILE);
}
close(DATA);
print "done\n";
print "Organizing letter file alphabetically - \n";
open(DATA,">all1.txt") || die "cannot do $!";
foreach my $filename (sort {lc($a) cmp lc($b)} <*.dat>) {
print "\topening $filename - ";
my @words = ();
open(FILE,$filename) || die "cannot do $!";
while(<FILE>) {
my $word = $_;
chomp($word);
push(@words,$word);
}
close(FILE);
print "\t\tsorting - ";
@words = sort {lc($a) cmp lc($b)} @words;
print "done\n";
print "\t\tremoving duplicated - ";
my $prev = "not equal to $words[0]";
@words = grep($_ ne $prev && ($prev = $_, 1), @words);
print "done\n";
foreach my $word (@words) {
print DATA "$word\n";
}
print "\tdone\n";
}
print "done\n";
Anyone have any ideas to make it go faster? Because right now it takes a good half an hour or more... and the file will eventually get larger. Thanks... :)