Re: Mysterious slow down with large data set
by BrowserUk (Patriarch) on Feb 26, 2012 at 22:13 UTC
|
I can't see what is causing the slowdown, but I can see one obvious thing that would speed it up a lot. You keep re-sorting the keys to %kernel every time when they do not change. Instead of:
foreach $w1 ( sort( keys %kernel ) ){
$totalsim = $maxsim = 0;
@topX = ();
$at2 = 0;
foreach $w2 ( sort( keys %kernel ) ) {
...
Using: my @sortedKeys = sort( keys %kernel );
foreach $w1 ( @sortedKeys ){
$totalsim = $maxsim = 0;
@topX = ();
$at2 = 0;
foreach $w2 ( @sortedKeys ) {
may speed things up to the point that the slowdown becomes insignificant.
Also, using a sort to track top N is probably slower than a simple linear insertion and truncate if necessary.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] [select] |
|
|
Yes, I can't believe I did that. Thank you!
Regarding the top N, I had previously tried this, which worked, but seemed more complicated. Is this what you had in mind?
} elsif ($sim > min(pdl(@topList))) {
$theMin = grep { $topX[$_] eq min(pdl(@topX)) } 0..$#topX;
# replace the smallest
$topX[$theMin] = $sim;
# add this one
push @topX, $sim;
}
Thanks! | [reply] [d/l] |
|
|
@topX = (-1) x 20;
...
$topX[ $_ ] < $sim and splice( @topX, $_, 0, $sim ), pop( @top
+X ), last for 0 .. 19;
A short-ciruited, linear insertion is at worst O(N) rather than O(N logN).
It speeds things a little, but doesn't address the slowdown which is happening exclusively (and inexplicably) inside PDL.
Unfortunately, the PDL documentation spends more time telling you about their 'philosophy'; and indexing the indexes to the documentation than is does telling you what these functions actually do; or how to inspect the results of what they did :(
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] |
Re: Mysterious slow down with large data set
by chromatic (Archbishop) on Feb 26, 2012 at 22:16 UTC
|
I cannot figure out what would make the program slow down so much...
Unnecessary work:
foreach $w1 (sort(keys %kernel)){
...
foreach $w2 (sort(keys %kernel)) { ... }
}
For every word in the hash, you sort all of the other words in the hash (even though you've already sorted them). Then you compare every word in the hash again, even those you've already compared.
Instead, sort the hash keys and assign them to an array. Then process the array one word at a time. Grab the first word, then compare it to the remaining words in the array. Then do the same for the second and so on. Note that the list of words to compare gets shorter with every word you process. This is exactly what you want.
| [reply] [d/l] |
|
|
Thanks for pointing out the unnecessary work.
Can I ask you about the idea that I can decrease the size of the list with each word? The problem is that I need to get the total similarity for each word to every other word. So I don't think I can decrease the number of comparisons per step without storing preceding results, but the memory demands are huge, even if I delete items from memory once they are retrieved.
But if you see another solution, please let me know. Thank you for your help!
| [reply] |
|
|
I don't know if the comparison of two words is direction dependent this mean compare word1 with word2 is the same as compare word2 to word1. If yes keep your algoritm if no follow chromatics tip and compare the first element to all elements in the arrays but the first one (the elements himself), then compare the second element to all other but first two because you already compared the first to a second. This what chromatic pointed to.
| [reply] |
Re: Mysterious slow down with large data set
by BrowserUk (Patriarch) on Feb 26, 2012 at 22:58 UTC
|
The slowdown is being caused by the PDL calculation. You can prove this by substituting $sim = rand() for that calculation and see that it runs at constant speed, 0.023s per iteration, where the PDL calculation takes an almost identical time for the first iteration and slows to 0.214s by the time you get to the 500th
My understanding of PDL is limited, but from a cursory inspection, you are performing the same calculation on what looks to be the same size piddles at each iteration, so there is no logic to it getting slower.
Unless of course, the length or form of the piddles is being changed each time you reuse it. What do innner() and norm() do to the underlying piddles? If they are growing, (possibly doubling?). in length each time you use them, it would explain the progression of the slowdown.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] |
Re: Mysterious slow down with large data set
by BrowserUk (Patriarch) on Feb 27, 2012 at 01:42 UTC
|
You can shed another chunk of time by norm()'ing your vectors once as you load them, rather than repeatedly:
while(<>){
chomp;
my ($wrd, @data) = split;
$kernel{$wrd} = norm( pdl(@data) );
...
my $sim = inner( $kernel{ $w1 }, $kernel{ $w2 } );
But the PDL -- now down to just the inner()s -- still exhibits the slowdown behaviour, despite that it doesn't do so when run in a tight loop on two similar vectors in a standalone test.
At this point you need some serious PDL knowledge to move forward I think.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] |
Re: Mysterious slow down with large data set
by jwkrahn (Abbot) on Feb 26, 2012 at 23:08 UTC
|
Not related to speed, but if you had warnings enabled you would get this message:
Name "main::startWord" used only once: possible typo at yourprogram line 42.
Which means that this line:
42 $thisWord = $now - $startWord;
Is actually processed as:
42 $thisWord = $now;
Also, Perl is not C, so this line:
38 printf("$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n");
Should be:
38 print "$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\n";
And these lines:
45 printf(STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t".
46 "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",
47 $elapsed, $thisWord, $perWord, $hoursRemaining);
Should be:
45 print STDERR "#$at\t$w1\t$totalsim\t$maxsim\t$topXtotal\t";
46 printf STDERR "ELAPSED %.3f THISWORD %.3f PERWORD %.3f HOURStoGO %.3f\n",
47 $elapsed, $thisWord, $perWord, $hoursRemaining;
| [reply] [d/l] [select] |
Re: Mysterious slow down with large data set
by BrowserUk (Patriarch) on Feb 27, 2012 at 02:03 UTC
|
next if($at == $at2); # skip identical item, but not homophones
$at2++;
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] [d/l] |
Re: Mysterious slow down with large data set
by BrowserUk (Patriarch) on Feb 27, 2012 at 18:23 UTC
|
BTW: If, once you've fixed your code to perform the complete processing you intend, and you find the indication that it is going to take 50+ hours to process, depressing, then come back.
Because whilst the calculation wouldn't be exactly as you have now, it might be possible to achieve your goal far more quickly.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
|
|
I stumbled on the source of the problem, but I do not understand the cause. In the code I pasted earlier, another wasteful thing I was doing was updating the topX list after each word. If instead I just save all the similarities and then use PDL functions to sort (and so find the topX, etc.), my problems go away (code below; and recall that the problems also went away if I skipped the PDL instruction, so it wasn't just those list operations).
So it seems like it was an interaction: the PDL inner function led to large slow downs when I was updating a small list, adding to the total similarity, etc., on each step. The code below now settles to a constant 336 msecs per word, and so the whole set can be processed in about 3 hours.
I've also gotten some advice from the PDL mailing list about how to use vectorized processes to speed this up tremendously. I'll report back if I manage to get that working.
Thanks, everyone, for your help!
jim
#!/usr/bin/perl -s
use PDL;
use PDL::NiceSlice;
use Time::HiRes qw ( time ) ;
$|=1;
$top = 20;
$realStart = time();
while(<>){
chomp;
($wrd, @data) = split;
$kernel{$wrd} = norm(pdl(@data));
# EXAMPLE LINE
# word 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0
+ 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
}
@kernelKeys = sort( keys %kernel );
printf STDERR "# read $#kernelKeys words in %.2f seconds\n",
time()-$realStart;
$startAll = time();
$at1 = 0;
printf "#REC\ttheWord\tMEAN\tMAX\tmeanTOP$top\tTIME\n";
foreach $w1 (@kernelKeys) {
$startWord = time();
@allSims = ();
$at2 = -1;
foreach $w2 (@kernelKeys) {
$at2++;
next if($at1 == $at2); # skip identical item, but not homophones
push @allSims, inner($kernel{$w1},$kernel{$w2});
# $sim = inner($kernel{$w1},$kernel{w2});
# $totalsim+=$sim;
# if($sim > $maxsim){ $maxsim = $sim; }
# # keep the top 20
# if($#topX < $top){
# push @topX, $sim;
# } else {
# @topX = sort { $a <=> $b } @topX;
# if($sim > $topX[0]){ $topX[0] = $sim; }
# }
}
$at1++;
$allSim = qsort(pdl(@allSims));
$now = time();
printf "$at1\t$w1\t%.6f\t%.6f\t%.6f\t%.5f\n",
sum($allSim)/$#kernelKeys, max($allSim),
sum($allSim->(($#kernelKeys - $top - 1 - 1):($#kernelKeys - 1)))
+/$top,
$now - $startWord;
unless($at1 % 25) {
$elapsed = $now - $startAll;
$thisWord = $now - $startWord;
$perWord = $elapsed / ($at1 + 1);
$hoursRemaining = ($perWord * ($#kernelKeys - $at1 + 1))/3600;
printf STDERR "$at1\t$w1\t %.6f\tElapsed %.6f\tPerWord %.6f\tHours
+ToGo %.6f\n",
$thisWord, $elapsed, $perWord, $hoursRemaining;
}
}
| [reply] [d/l] |
|
|
o the whole set can be processed in about 3 hours.
Glad you found a resolution, though 3 hours still isn't quick.
But I wonder if you are open to having your methodology questioned? (By someone who has little knowledge of your goals and probably wouldn't understand them if he did :)
Specifically I'd like to ask you about "word similarity"?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
| [reply] |
| A reply falls below the community's threshold of quality. You may see it by logging in. |