Re: Mysterious slow down with large data set

Replies are listed 'Best First'.
Re^2: Mysterious slow down with large data set by jsmagnuson (Acolyte) on Feb 27, 2012 at 22:49 UTC
I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations). So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours. I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working. Thanks, everyone, for your help! jim #!/usr/bin/perl -s use PDL; use PDL::NiceSlice; use Time::HiRes qw ( time ) ; $\|=1; $top = 20; $realStart = time(); while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = norm(pdl(@data)); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } @kernelKeys = sort( keys %kernel ); printf STDERR "# read $#kernelKeys words in %.2f seconds\n", time()-$realStart; $startAll = time(); $at1 = 0; printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n"; foreach $w1 (@kernelKeys) { $startWord = time(); @allSims = (); $at2 = -1; foreach $w2 (@kernelKeys) { $at2++; next if($at1 == $at2); # skip identical item, but not homophones push @allSims, inner($kernel{$w1},$kernel{$w2}); # $sim = inner($kernel{$w1},$kernel{w2}); # $totalsim+=$sim; # if($sim > $maxsim){ $maxsim = $sim; } # # keep the top 20 # if($#topX < $top){ # push @topX, $sim; # } else { # @topX = sort { $a <=> $b } @topX; # if($sim > $topX[0]){ $topX[0] = $sim; } # } } $at1++; $allSim = qsort(pdl(@allSims)); $now = time(); printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n", sum($allSim)/$#kernelKeys, max($allSim), sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1))) +/$top, $now - $startWord; unless($at1 % 25) { $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / ($at1 + 1); $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600; printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours +ToGo %.6f\n", $thisWord, $elapsed, $perWord, $hoursRemaining; } } [download]	[reply] [d/l]
Re^3: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 23:16 UTC
o the whole set can be processed in about 3 hours. Glad you found a resolution, though 3 hours still isn't quick. But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :) Specifically I'd like to ask you about "word similarity"? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]