I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).

So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.

I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.

Thanks, everyone, for your help!

jim
#!/usr/bin/perl -s use PDL; use PDL::NiceSlice; use Time::HiRes qw ( time ) ; $|=1; $top = 20; $realStart = time(); while(<>){ chomp; ($wrd, @data) = split; $kernel{$wrd} = norm(pdl(@data)); # EXAMPLE LINE # word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 + 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 +0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 } @kernelKeys = sort( keys %kernel ); printf STDERR "# read $#kernelKeys words in %.2f seconds\n", time()-$realStart; $startAll = time(); $at1 = 0; printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n"; foreach $w1 (@kernelKeys) { $startWord = time(); @allSims = (); $at2 = -1; foreach $w2 (@kernelKeys) { $at2++; next if($at1 == $at2); # skip identical item, but not homophones push @allSims, inner($kernel{$w1},$kernel{$w2}); # $sim = inner($kernel{$w1},$kernel{w2}); # $totalsim+=$sim; # if($sim > $maxsim){ $maxsim = $sim; } # # keep the top 20 # if($#topX < $top){ # push @topX, $sim; # } else { # @topX = sort { $a <=> $b } @topX; # if($sim > $topX[0]){ $topX[0] = $sim; } # } } $at1++; $allSim = qsort(pdl(@allSims)); $now = time(); printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n", sum($allSim)/$#kernelKeys, max($allSim), sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1))) +/$top, $now - $startWord; unless($at1 % 25) { $elapsed = $now - $startAll; $thisWord = $now - $startWord; $perWord = $elapsed / ($at1 + 1); $hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600; printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours +ToGo %.6f\n", $thisWord, $elapsed, $perWord, $hoursRemaining; } }

In reply to Re^2: Mysterious slow down with large data set by jsmagnuson
in thread Mysterious slow down with large data set by jsmagnuson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.