Seeking a fast sum_of_ranks

msh210 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.) by BrowserUk (Patriarch) on Oct 14, 2015 at 00:22 UTC
Update: Greatly simplified the else case. Try this: #! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; sub rankSums { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { if( $aRef->[ $a ] < $bRef->[ $b ] ) { $aSum += $rank++; ++$a; } elsif( $aRef->[ $a ] > $bRef->[ $b ] ) { $bSum += $rank++; ++$b } else { $aSum += ( $rank * 2 + 1 ) / 2; $bSum += ( $rank * 2 + 1 ) / 2; $rank += 2; ++$a, ++$b; } } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; } my @a = split ' ', <DATA>; my @b = split ' ', <DATA>; my( $aSum, $bSum ) = rankSums( \@a, \@b ); print "asum:$aSum bSum:$bSum"; #__DATA__ #1 3 5 7 9 #2 4 6 8 10 #__DATA__ #1 2 3 4 5 6 7 8 9 10 #3.14 4.25 5.36 6.47 7.58 __DATA__ 1 2 3 4 5 3 3.14 4 4 [download] It makes a single pass over the data, and does no copying or sorting or memory allocation, so it should be considerably faster than the current method, but full testing & benchmarking it is your task :) A further simplified version of the function that runs a tad quicker and has been tested with 1000 runs of 1e6 x 1e6 random integers: `sub rankSums3 { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { $aSum += $rank++, ++$a, next if $aRef->[ $a ] < $bRef->[ $b ]; $bSum += $rank++, ++$b, next if $aRef->[ $a ] > $bRef->[ $b ]; $aSum += ( $rank * 2 + 1 ) / 2; $bSum += ( $rank * 2 + 1 ) / 2; $rank += 2; ++$a, ++$b; } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; }` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.) by msh210 (Monk) on Oct 14, 2015 at 14:20 UTC
Many thanks, BrowserUk! Unless I'm misunderstanding your code, it seems to me that the case `$aRef->[$a] == $bRef->[$b]` is handled wrong. Consider the case (as in the example) `$aRef=[1..5]; $bRef=[3, 3.14, 4, 4];`. When the loop restarts at `$rank==6, $a==3, $b==2, $aRef[$a]==4, $bRef[$b]==4`, it adds 6.5 to `$aSum` and `$bSum` and increments `$a` and `$b` and double-increments `$rank`. Then the next run-through is at `$rank==8, $a==4, $b==3, $aRef[$a]==5, $bRef[$b]==4`, and it then adds 8 to `$bSum`. So, all in all, it adds 6.5 to `$aSum` and 6.5 and 8 to `$bSum`; but it should be adding 7 to `$aSum` and twice 7 to `$bSum`. (And that makes a big difference to me, because my data is likely to have a lot of "ties".) But thank you, again.	[reply] [d/l] [select]
Re^3: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.) by BrowserUk (Patriarch) on Oct 14, 2015 at 14:52 UTC
Damn! The "simplifications" I applied subtly changed the results. Please try again with the original version I posted and see if that matches your existing method? sub rankSums { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { if( $aRef->[ $a ] < $bRef->[ $b ] ) { $aSum += $rank++; ++$a; } elsif( $aRef->[ $a ] > $bRef->[ $b ] ) { $bSum += $rank++; ++$b } else { my $d = 2; my( $aSaved, $bSaved ) = ( $a, $b ); ++$d, ++$a while $a < $#{ $aRef } && $aRef->[ $a ] == $aRe +f->[ $a + 1 ]; ++$d, ++$b while $b < $#{ $bRef } && $bRef->[ $b ] == $bRe +f->[ $b + 1 ]; my $s = sum( $rank .. $rank + $d - 1 ) / $d; $aSum += $s * ( $a - $aSaved + 1 ); $bSum += $s * ( $b - $bSaved + 1 ); $rank += $d; ++$a, ++$b; } } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; } [download] The bonus is (assuming it's correct this time), is that with datasets containing many ties, it actually runs faster too. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^4: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.) by msh210 (Monk) on Oct 14, 2015 at 15:33 UTC
Re^5: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.) by toolic (Bishop) on Oct 14, 2015 at 15:45 UTC
Some notes below your chosen depth have not been shown here
Re: Seeking a fast sum_of_ranks_between function by BrowserUk (Patriarch) on Oct 13, 2015 at 19:49 UTC
Asking us to dig through what is a fairly complex module with a fairly complex set of dependencies to work out what you are asking for is not very conducive to your getting help. Can you post a (small) sample set of data, the required results, and describe the process by which they are obtained? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :) In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^2: Seeking a fast sum_of_ranks_between function by msh210 (Monk) on Oct 13, 2015 at 20:02 UTC
Thanks for posting the request for clarification. The function takes two arrayrefs of numbers, listing all the elements from both arrays in order, ranking them, and returning the sum of ranks for each array. For example, suppose we have `my @one=(1..10); my @two=(3.14,4.25,5.36,6.47,7.58);`. Then the elements, in order, are 1, 2, 3, 3.14, 4, 4.25, 5, 5.36, 6, 6.47, 7, 7.58, 8, 9, 10, so the ranks are: 1, 2, 3, 3.14, 4, 4.25, 5, 5.36, 6, 6.47, 7, 7.58, 8, 9, 10 1, 2, 3, 4 , 5, 6 , 7, 8 , 9, 10 , 11, 12, 13, 14, 15 # ranks Then `@one` would have the ranks 1, 2, 3, 5, 7, 9, 11, 13, 14, 15, for a sum of 80, and `@two` would have the ranks 4, 6, 8, 10, 12, for a sum of 40, so the function could return `{one=>80, two=>40}`. (That's not exactly how the module's function works: it takes a hashref as an argument, and… whatever. But the point is that it needs to "know about" two arrays and return a number for each.) It gets more complicated when there are multiple copies of the same number. If `@one=(1..5); @two=(3,3.14,4,4);` then the ranking is 1, 2, 3 , 3 , 3.14, 4, 4, 4, 5 1, 2, 3.5, 3.5, 5 , 7, 7, 7, 9 # ranks because each element is given the arithmetic mean (average) of the possible ranks it qualifies for (the mean of 6, 7, and 8 is 7; the mean of 3 and 4 is 3.5).	[reply] [d/l] [select]
Re^3: Seeking a fast sum_of_ranks_between function by Laurent_R (Canon) on Oct 13, 2015 at 23:15 UTC
I have gathered that you have two subsets, each with about a million data points. And that you say it is very slow. How slow is "very slow"? Can you give a time frame? I am asking because it does not appear to me that it should be very slow (unless I missed a very time-consuming step in the algorithm description). It could be that the module is doing a number of things not really necessary in your specific case and that rolling out your own sub or set of subs might be faster, or that it could be optimized some other way. But is "very slow" is a few seconds or many hours? In the latter case (assuming I have understood what needs to be done), I am convinced it could be faster. A few seconds, I would not even try to defeat the module in terms or performance. In between, well, then, it is is your and our draw. Please give a better estimate of the time frame. Of course, any further action would require a representative dataset (perhaps significantly smaller, but sufficient for benchmarking. @ BrowserUK: thanks for asking, I thought about more or less the same things, but somehow did not dare to ask.	[reply]
Re^4: Seeking a fast sum_of_ranks_between function by msh210 (Monk) on Oct 14, 2015 at 14:00 UTC