Re^2: Mysterious slow down with large data set

I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).

So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.

I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.

Thanks, everyone, for your help!

jim


#!/usr/bin/perl -s
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;

$realStart = time();

while(<>){ 
    chomp;
    ($wrd, @data) = split;
    $kernel{$wrd} = norm(pdl(@data));
    # EXAMPLE LINE
    # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
+ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    
}

@kernelKeys = sort( keys %kernel );
printf STDERR "# read $#kernelKeys words in %.2f seconds\n",
  time()-$realStart;


$startAll = time();

$at1 = 0;
printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n";

foreach $w1 (@kernelKeys) {
  $startWord = time();
  @allSims = ();
  $at2 = -1;
  foreach $w2 (@kernelKeys) {
    $at2++;
    next if($at1 == $at2); # skip identical item, but not homophones
    push @allSims, inner($kernel{$w1},$kernel{$w2});
#    $sim = inner($kernel{$w1},$kernel{w2});
#    $totalsim+=$sim;
#    if($sim > $maxsim){      $maxsim = $sim;    }
#    # keep the top 20
#    if($#topX < $top){
#      push @topX, $sim;
#    } else { 
#      @topX = sort { $a <=> $b } @topX;
#      if($sim > $topX[0]){    $topX[0] = $sim;      }
#    }
  }
  $at1++;

  
  $allSim = qsort(pdl(@allSims));
  $now = time();
  printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n",
    sum($allSim)/$#kernelKeys, max($allSim), 
      sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1)))
+/$top, 
      $now - $startWord;
  unless($at1 % 25) {
    $elapsed = $now - $startAll;
    $thisWord = $now - $startWord;
    $perWord = $elapsed / ($at1 + 1);
    $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600;
    printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours
+ToGo %.6f\n",
      $thisWord, $elapsed, $perWord, $hoursRemaining;
  }
}
[download]

Comment on Re^2: Mysterious slow down with large data set Download Code

Replies are listed 'Best First'.
Re^3: Mysterious slow down with large data set by BrowserUk (Patriarch) on Feb 27, 2012 at 23:16 UTC
o the whole set can be processed in about 3 hours. Glad you found a resolution, though 3 hours still isn't quick. But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :) Specifically I'd like to ask you about "word similarity"? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]

Replies are listed 'Best First'.

Re^3: Mysterious slow down with large data set
by BrowserUk (Patriarch) on Feb 27, 2012 at 23:16 UTC

o the whole set can be processed in about 3 hours.

Glad you found a resolution, though 3 hours still isn't quick.

But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :)

Specifically I'd like to ask you about "word similarity"?

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

The start of some sanity?

[reply]