I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).
So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.
I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.
Thanks, everyone, for your help!
jim
#!/usr/bin/perl -s
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;
$realStart = time();
while(<>){
chomp;
($wrd, @data) = split;
$kernel{$wrd} = norm(pdl(@data));
# EXAMPLE LINE
# word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
+ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
}
@kernelKeys = sort( keys %kernel );
printf STDERR "# read $#kernelKeys words in %.2f seconds\n",
time()-$realStart;
$startAll = time();
$at1 = 0;
printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n";
foreach $w1 (@kernelKeys) {
$startWord = time();
@allSims = ();
$at2 = -1;
foreach $w2 (@kernelKeys) {
$at2++;
next if($at1 == $at2); # skip identical item, but not homophones
push @allSims, inner($kernel{$w1},$kernel{$w2});
# $sim = inner($kernel{$w1},$kernel{w2});
# $totalsim+=$sim;
# if($sim > $maxsim){ $maxsim = $sim; }
# # keep the top 20
# if($#topX < $top){
# push @topX, $sim;
# } else {
# @topX = sort { $a <=> $b } @topX;
# if($sim > $topX[0]){ $topX[0] = $sim; }
# }
}
$at1++;
$allSim = qsort(pdl(@allSims));
$now = time();
printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n",
sum($allSim)/$#kernelKeys, max($allSim),
sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1)))
+/$top,
$now - $startWord;
unless($at1 % 25) {
$elapsed = $now - $startAll;
$thisWord = $now - $startWord;
$perWord = $elapsed / ($at1 + 1);
$hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600;
printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours
+ToGo %.6f\n",
$thisWord, $elapsed, $perWord, $hoursRemaining;
}
}
|