in reply to Re^2: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.)
in thread Seeking a fast sum_of_ranks_between function

Damn! The "simplifications" I applied subtly changed the results. Please try again with the original version I posted and see if that matches your existing method?

sub rankSums { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { if( $aRef->[ $a ] < $bRef->[ $b ] ) { $aSum += $rank++; ++$a; } elsif( $aRef->[ $a ] > $bRef->[ $b ] ) { $bSum += $rank++; ++$b } else { my $d = 2; my( $aSaved, $bSaved ) = ( $a, $b ); ++$d, ++$a while $a < $#{ $aRef } && $aRef->[ $a ] == $aRe +f->[ $a + 1 ]; ++$d, ++$b while $b < $#{ $bRef } && $bRef->[ $b ] == $bRe +f->[ $b + 1 ]; my $s = sum( $rank .. $rank + $d - 1 ) / $d; $aSum += $s * ( $a - $aSaved + 1 ); $bSum += $s * ( $b - $bSaved + 1 ); $rank += $d; ++$a, ++$b; } } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; }

The bonus is (assuming it's correct this time), is that with datasets containing many ties, it actually runs faster too.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^3: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.)
  • Download Code

Replies are listed 'Best First'.
Re^4: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.)
by msh210 (Monk) on Oct 14, 2015 at 15:33 UTC

    Many thanks, again, BrowserUk!

    That looks right; also, I checked its results, not for large amounts of data, but for a few edge cases, and it works correctly for them. And it is much faster than the module I started with. I really appreciate your having done this.

    Keywords for later searchability are ranksum, rank sum.

      I think the Statistics::Data::Rank module author would be interested in these results. Posting a simple benchmark on RT might be useful.

        Great idea. I didn't benchmark, but included a link hither in a bug report so that the maintainer knows about this thread.