in reply to Mysterious slow down with large data set

BTW: If, once you've fixed your code to perform the complete processing you intend, and you find the indication that it is going to take 50+ hours to process, depressing, then come back.

Because whilst the calculation wouldn't be exactly as you have now, it might be possible to achieve your goal far more quickly.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

  • Comment on Re: Mysterious slow down with large data set

Replies are listed 'Best First'.
Re^2: Mysterious slow down with large data set
by jsmagnuson (Acolyte) on Feb 27, 2012 at 22:49 UTC

    I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).

    So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.

    I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.

    Thanks, everyone, for your help!

    jim
    #!/usr/bin/perl -s use PDL; use PDL::NiceSlice; use Time::HiRes qw ( time ) ; $|=1; $top = 20; $realStart = time(); while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = norm(pdl(@data)); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } @kernelKeys = sort( keys %kernel ); printf STDERR "# read $#kernelKeys words in %.2f seconds\n", time()-$realStart; $startAll = time(); $at1 = 0; printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n"; foreach $w1 (@kernelKeys) { $startWord = time(); @allSims = (); $at2 = -1; foreach $w2 (@kernelKeys) { $at2++; next if($at1 == $at2); # skip identical item, but not homophones push @allSims, inner($kernel{$w1},$kernel{$w2}); # $sim = inner($kernel{$w1},$kernel{w2}); # $totalsim+=$sim; # if($sim > $maxsim){ $maxsim = $sim; } # # keep the top 20 # if($#topX < $top){ # push @topX, $sim; # } else { # @topX = sort { $a <=> $b } @topX; # if($sim > $topX[0]){ $topX[0] = $sim; } # } } $at1++; $allSim = qsort(pdl(@allSims)); $now = time(); printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n", sum($allSim)/$#kernelKeys, max($allSim), sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1))) +/$top, $now - $startWord; unless($at1 % 25) { $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / ($at1 + 1); $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600; printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours +ToGo %.6f\n", $thisWord, $elapsed, $perWord, $hoursRemaining; } }
      o the whole set can be processed in about 3 hours.

      Glad you found a resolution, though 3 hours still isn't quick.

      But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :)

      Specifically I'd like to ask you about "word similarity"?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      The start of some sanity?