in reply to Seeking a fast sum_of_ranks_between function

Update: Greatly simplified the else case.

Try this:

#! perl -slw use strict; use Data::Dump qw[ pp ]; use List::Util qw[ sum ]; sub rankSums { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { if( $aRef->[ $a ] < $bRef->[ $b ] ) { $aSum += $rank++; ++$a; } elsif( $aRef->[ $a ] > $bRef->[ $b ] ) { $bSum += $rank++; ++$b } else { $aSum += ( $rank * 2 + 1 ) / 2; $bSum += ( $rank * 2 + 1 ) / 2; $rank += 2; ++$a, ++$b; } } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; } my @a = split ' ', <DATA>; my @b = split ' ', <DATA>; my( $aSum, $bSum ) = rankSums( \@a, \@b ); print "asum:$aSum bSum:$bSum"; #__DATA__ #1 3 5 7 9 #2 4 6 8 10 #__DATA__ #1 2 3 4 5 6 7 8 9 10 #3.14 4.25 5.36 6.47 7.58 __DATA__ 1 2 3 4 5 3 3.14 4 4

It makes a single pass over the data, and does no copying or sorting or memory allocation, so it should be considerably faster than the current method, but full testing & benchmarking it is your task :)

A further simplified version of the function that runs a tad quicker and has been tested with 1000 runs of 1e6 x 1e6 random integers:

sub rankSums3 { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { $aSum += $rank++, ++$a, next if $aRef->[ $a ] < $bRef->[ $b ]; $bSum += $rank++, ++$b, next if $aRef->[ $a ] > $bRef->[ $b ]; $aSum += ( $rank * 2 + 1 ) / 2; $bSum += ( $rank * 2 + 1 ) / 2; $rank += 2; ++$a, ++$b; } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; }

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.)
by msh210 (Monk) on Oct 14, 2015 at 14:20 UTC

    Many thanks, BrowserUk!

    Unless I'm misunderstanding your code, it seems to me that the case $aRef->[$a] == $bRef->[$b] is handled wrong. Consider the case (as in the example) $aRef=[1..5]; $bRef=[3, 3.14, 4, 4];. When the loop restarts at $rank==6, $a==3, $b==2, $aRef[$a]==4, $bRef[$b]==4, it adds 6.5 to $aSum and $bSum and increments $a and $b and double-increments $rank. Then the next run-through is at $rank==8, $a==4, $b==3, $aRef[$a]==5, $bRef[$b]==4, and it then adds 8 to $bSum. So, all in all, it adds 6.5 to $aSum and 6.5 and 8 to $bSum; but it should be adding 7 to $aSum and twice 7 to $bSum.

    (And that makes a big difference to me, because my data is likely to have a lot of "ties".) But thank you, again.

      Damn! The "simplifications" I applied subtly changed the results. Please try again with the original version I posted and see if that matches your existing method?

      sub rankSums { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { if( $aRef->[ $a ] < $bRef->[ $b ] ) { $aSum += $rank++; ++$a; } elsif( $aRef->[ $a ] > $bRef->[ $b ] ) { $bSum += $rank++; ++$b } else { my $d = 2; my( $aSaved, $bSaved ) = ( $a, $b ); ++$d, ++$a while $a < $#{ $aRef } && $aRef->[ $a ] == $aRe +f->[ $a + 1 ]; ++$d, ++$b while $b < $#{ $bRef } && $bRef->[ $b ] == $bRe +f->[ $b + 1 ]; my $s = sum( $rank .. $rank + $d - 1 ) / $d; $aSum += $s * ( $a - $aSaved + 1 ); $bSum += $s * ( $b - $bSaved + 1 ); $rank += $d; ++$a, ++$b; } } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; }

      The bonus is (assuming it's correct this time), is that with datasets containing many ties, it actually runs faster too.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Many thanks, again, BrowserUk!

        That looks right; also, I checked its results, not for large amounts of data, but for a few edge cases, and it works correctly for them. And it is much faster than the module I started with. I really appreciate your having done this.

        Keywords for later searchability are ranksum, rank sum.