comment on

I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).

So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.

I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.

Thanks, everyone, for your help!

jim


#!/usr/bin/perl -s
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;

$realStart = time();

while(<>){ 
    chomp;
    ($wrd, @data) = split;
    $kernel{$wrd} = norm(pdl(@data));
    # EXAMPLE LINE
    # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
+ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    
}

@kernelKeys = sort( keys %kernel );
printf STDERR "# read $#kernelKeys words in %.2f seconds\n",
  time()-$realStart;


$startAll = time();

$at1 = 0;
printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n";

foreach $w1 (@kernelKeys) {
  $startWord = time();
  @allSims = ();
  $at2 = -1;
  foreach $w2 (@kernelKeys) {
    $at2++;
    next if($at1 == $at2); # skip identical item, but not homophones
    push @allSims, inner($kernel{$w1},$kernel{$w2});
#    $sim = inner($kernel{$w1},$kernel{w2});
#    $totalsim+=$sim;
#    if($sim > $maxsim){      $maxsim = $sim;    }
#    # keep the top 20
#    if($#topX < $top){
#      push @topX, $sim;
#    } else { 
#      @topX = sort { $a <=> $b } @topX;
#      if($sim > $topX[0]){    $topX[0] = $sim;      }
#    }
  }
  $at1++;

  
  $allSim = qsort(pdl(@allSims));
  $now = time();
  printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n",
    sum($allSim)/$#kernelKeys, max($allSim), 
      sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1)))
+/$top, 
      $now - $startWord;
  unless($at1 % 25) {
    $elapsed = $now - $startAll;
    $thisWord = $now - $startWord;
    $perWord = $elapsed / ($at1 + 1);
    $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600;
    printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours
+ToGo %.6f\n",
      $thisWord, $elapsed, $perWord, $hoursRemaining;
  }
}
[download]

In reply to Re^2: Mysterious slow down with large data set by jsmagnuson
in thread Mysterious slow down with large data set by jsmagnuson

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.