Mysterious slow down with large data set

jsmagnuson has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I have a set of about 30,000 words, and I am using string kernels as a metric of word similarity. The goal is to see whether different kernels are better at predicting how quickly human subjects are able to process words.

I have calculated the string kernels for each word (with help from this marvelous group). So now I have a file with 30,000 lines. The first field in each line is a word, and this is followed by a 676-element vector representing the kernel representation.

Once I read this in, I need to step through and calculate the similarity of each word to every other word using vector cosine, as well as track the highest similarity value (excluding the word itself), and the set of X-most similar items (there are reasons to believe these are good predictors of human performance).

Here's the problem: when I start running the code below, it is very fast. It takes 38 msecs to process the first word, but by the time it reaches the 100th it is taking 80 msecs, and by the 400th it is taking 200 msecs.

Memory use by perl stays constant, and I cannot figure out what would make the program slow down so much -- I know it's not well-written, as I'm a cognitive neuroscientist and not a very good programmer. But I'm really stumped as to what is causing the slow down. If I take out the code for tracking the top X items, it doesn't slow down nearly as much, but it still slows down (first item = 38msecs, 100th = 44, 400th=85). But I really need to do that tracking...

So if anyone can give me pointers as to what is slowing things down and whether there is a way to avoid it, I would be most grateful.

Thanks!

jim

#!/usr/bin/perl -s
use PDL;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;

while(<>){ 
    chomp;
    ($wrd, @data) = split;
    $kernel{$wrd} = pdl(@data);
    # EXAMPLE LINE
    # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
+ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

}
$nrecs = keys %kernel;
$startAll = time();
$at = 0;
foreach $w1 (sort(keys %kernel)){
  $totalsim = $maxsim = 0;
  @topX = ();
  $at2 = 0;
  foreach $w2 (sort(keys %kernel)) { 
    next if($at == $at2); # skip identical item, but not homophones
    $at2++;
    $sim = inner(norm($kernel{$w1}),norm($kernel{$w2}));
    $totalsim+=$sim;
    if($sim > $maxsim){      $maxsim = $sim;    }
    # keep the top 20
    if($#topX < $top){
      push @topX, $sim;
    } else { 
      @topX = sort { $a <=> $b } @topX;
      if($sim > $topX[0]){    $topX[0] = $sim;      }
    }
  }
  $at++;
  $topXtotal = sum(pdl(@topX));
  printf("$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n");
  unless($at % 10){ 
    $now = time();
    $elapsed = $now - $startAll;
    $thisWord = $now - $startWord;
    $perWord = $elapsed / $at;
    $hoursRemaining = (($nrecs - $at) * $perWord)/3600;
    printf(STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t".
       "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",
       $elapsed, $thisWord, $perWord, $hoursRemaining);
  }
}
[download]

Comment on Mysterious slow down with large data set Download Code

Replies are listed 'Best First'.
Re: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 26, 2012 at 22:13 UTC
I can't see what is causing the slowdown, but I can see one obvious thing that would speed it up a lot. You keep re-sorting the keys to %kernel every time when they do not change. Instead of: `foreach $w1 ( sort( keys %kernel ) ){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 ( sort( keys %kernel ) ) { ...` [download] Using: `my @sortedKeys = sort( keys %kernel ); foreach $w1 ( @sortedKeys ){ $totalsim = $maxsim = 0; @topX = (); $at2 = 0; foreach $w2 ( @sortedKeys ) {` [download] may speed things up to the point that the slowdown becomes insignificant. Also, using a sort to track top N is probably slower than a simple linear insertion and truncate if necessary. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re^2: Mysterious slow down with large data set by jsmagnuson (Acolyte) on Feb 26, 2012 at 23:44 UTC
Yes, I can't believe I did that. Thank you! Regarding the top N, I had previously tried this, which worked, but seemed more complicated. Is this what you had in mind? `} elsif ($sim > min(pdl(@topList))) { $theMin = grep { $topX[$_] eq min(pdl(@topX)) } 0..$#topX; # replace the smallest $topX[$theMin] = $sim; # add this one push @topX, $sim; }` [download] Thanks!	[reply] [d/l]
Re^3: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 00:22 UTC
I tried it this way: `@topX = (-1) x 20; ... $topX[ $_ ] < $sim and splice( @topX, $_, 0, $sim ), pop( @top +X ), last for 0 .. 19;` [download] A short-ciruited, linear insertion is at worst O(N) rather than O(N logN). It speeds things a little, but doesn't address the slowdown which is happening exclusively (and inexplicably) inside PDL. Unfortunately, the PDL documentation spends more time telling you about their 'philosophy'; and indexing the indexes to the documentation than is does telling you what these functions actually do; or how to inspect the results of what they did :( With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: Mysterious slow down with large data set by chromatic (Archbishop) on Feb 26, 2012 at 22:16 UTC
I cannot figure out what would make the program slow down so much... Unnecessary work: `foreach $w1 (sort(keys %kernel)){ ... foreach $w2 (sort(keys %kernel)) { ... } }` [download] For every word in the hash, you sort all of the other words in the hash (even though you've already sorted them). Then you compare every word in the hash again, even those you've already compared. Instead, sort the hash keys and assign them to an array. Then process the array one word at a time. Grab the first word, then compare it to the remaining words in the array. Then do the same for the second and so on. Note that the list of words to compare gets shorter with every word you process. This is exactly what you want. Improve your skills with Modern Perl: the free book.	[reply] [d/l]
Re^2: Mysterious slow down with large data set by jsmagnuson (Acolyte) on Feb 26, 2012 at 23:52 UTC
Thanks for pointing out the unnecessary work. Can I ask you about the idea that I can decrease the size of the list with each word? The problem is that I need to get the total similarity for each word to every other word. So I don't think I can decrease the number of comparisons per step without storing preceding results, but the memory demands are huge, even if I delete items from memory once they are retrieved. But if you see another solution, please let me know. Thank you for your help!	[reply]
Re^3: Mysterious slow down with large data set by wwe (Friar) on Feb 27, 2012 at 14:29 UTC
I don't know if the comparison of two words is direction dependent this mean compare word1 with word2 is the same as compare word2 to word1. If yes keep your algoritm if no follow chromatics tip and compare the first element to all elements in the arrays but the first one (the elements himself), then compare the second element to all other but first two because you already compared the first to a second. This what chromatic pointed to.	[reply]
Re: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 26, 2012 at 22:58 UTC
The slowdown is being caused by the PDL calculation. You can prove this by substituting `$sim = rand()` for that calculation and see that it runs at constant speed, 0.023s per iteration, where the PDL calculation takes an almost identical time for the first iteration and slows to 0.214s by the time you get to the 500th My understanding of PDL is limited, but from a cursory inspection, you are performing the same calculation on what looks to be the same size piddles at each iteration, so there is no logic to it getting slower. Unless of course, the length or form of the piddles is being changed each time you reuse it. What do innner() and norm() do to the underlying piddles? If they are growing, (possibly doubling?). in length each time you use them, it would explain the progression of the slowdown. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 01:42 UTC
You can shed another chunk of time by norm()'ing your vectors once as you load them, rather than repeatedly: `while(<>){ chomp; my ($wrd, @data) = split; $kernel{$wrd} = norm( pdl(@data) ); ... my $sim = inner( $kernel{ $w1 }, $kernel{ $w2 } );` [download] But the PDL -- now down to just the inner()s -- still exhibits the slowdown behaviour, despite that it doesn't do so when run in a tight loop on two similar vectors in a standalone test. At this point you need some serious PDL knowledge to move forward I think. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: Mysterious slow down with large data set by jwkrahn (Abbot) on Feb 26, 2012 at 23:08 UTC
Not related to speed, but if you had warnings enabled you would get this message: `Name "main::startWord" used only once: possible typo at yourprogram line 42.` Which means that this line: `42 $thisWord = $now - $startWord;` Is actually processed as: `42 $thisWord = $now;` Also, Perl is not C, so this line: `38 printf("$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n");` Should be: `38 print "$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n";` And these lines: `45 printf(STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t".` `46 "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",` `47 $elapsed, $thisWord, $perWord, $hoursRemaining);` Should be: `45 print STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t";` `46 printf STDERR "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",` `47 $elapsed, $thisWord, $perWord, $hoursRemaining;`	[reply] [d/l] [select]
Re: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 02:03 UTC
Try switching the order of these two statements around and see what happens. `next if($at == $at2); # skip identical item, but not homophones $at2++;` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l]
Re: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 18:23 UTC
BTW: If, once you've fixed your code to perform the complete processing you intend, and you find the indication that it is going to take 50+ hours to process, depressing, then come back. Because whilst the calculation wouldn't be exactly as you have now, it might be possible to achieve your goal far more quickly. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re^2: Mysterious slow down with large data set by jsmagnuson (Acolyte) on Feb 27, 2012 at 22:49 UTC
I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations). So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours. I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working. Thanks, everyone, for your help! jim #!/usr/bin/perl -s use PDL; use PDL::NiceSlice; use Time::HiRes qw ( time ) ; $\|=1; $top = 20; $realStart = time(); while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = norm(pdl(@data)); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } @kernelKeys = sort( keys %kernel ); printf STDERR "# read $#kernelKeys words in %.2f seconds\n", time()-$realStart; $startAll = time(); $at1 = 0; printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n"; foreach $w1 (@kernelKeys) { $startWord = time(); @allSims = (); $at2 = -1; foreach $w2 (@kernelKeys) { $at2++; next if($at1 == $at2); # skip identical item, but not homophones push @allSims, inner($kernel{$w1},$kernel{$w2}); # $sim = inner($kernel{$w1},$kernel{w2}); # $totalsim+=$sim; # if($sim > $maxsim){ $maxsim = $sim; } # # keep the top 20 # if($#topX < $top){ # push @topX, $sim; # } else { # @topX = sort { $a <=> $b } @topX; # if($sim > $topX[0]){ $topX[0] = $sim; } # } } $at1++; $allSim = qsort(pdl(@allSims)); $now = time(); printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n", sum($allSim)/$#kernelKeys, max($allSim), sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1))) +/$top, $now - $startWord; unless($at1 % 25) { $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / ($at1 + 1); $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600; printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours +ToGo %.6f\n", $thisWord, $elapsed, $perWord, $hoursRemaining; } } [download]	[reply] [d/l]
Re^3: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 23:16 UTC
o the whole set can be processed in about 3 hours. Glad you found a resolution, though 3 hours still isn't quick. But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :) Specifically I'd like to ask you about "word similarity"? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
A reply falls below the community's threshold of quality. You may see it by logging in.