in reply to Re: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.)
in thread Seeking a fast sum_of_ranks_between function

Many thanks, BrowserUk!

Unless I'm misunderstanding your code, it seems to me that the case $aRef->[$a] == $bRef->[$b] is handled wrong. Consider the case (as in the example) $aRef=[1..5]; $bRef=[3, 3.14, 4, 4];. When the loop restarts at $rank==6, $a==3, $b==2, $aRef[$a]==4, $bRef[$b]==4, it adds 6.5 to $aSum and $bSum and increments $a and $b and double-increments $rank. Then the next run-through is at $rank==8, $a==4, $b==3, $aRef[$a]==5, $bRef[$b]==4, and it then adds 8 to $bSum. So, all in all, it adds 6.5 to $aSum and 6.5 and 8 to $bSum; but it should be adding 7 to $aSum and twice 7 to $bSum.

(And that makes a big difference to me, because my data is likely to have a lot of "ties".) But thank you, again.

Replies are listed 'Best First'.
Re^3: Seeking a fast sum_of_ranks_between function (1e6 x 1e6 in 1/5th sec.)
by BrowserUk (Patriarch) on Oct 14, 2015 at 14:52 UTC

    Damn! The "simplifications" I applied subtly changed the results. Please try again with the original version I posted and see if that matches your existing method?

    sub rankSums { my( $aRef, $bRef ) = @_; my( $aSum, $bSum ) = (0) x 2; my( $a, $b ) = (0) x 2; my $rank = 1; while( $a < @$aRef && $b < @$bRef ) { if( $aRef->[ $a ] < $bRef->[ $b ] ) { $aSum += $rank++; ++$a; } elsif( $aRef->[ $a ] > $bRef->[ $b ] ) { $bSum += $rank++; ++$b } else { my $d = 2; my( $aSaved, $bSaved ) = ( $a, $b ); ++$d, ++$a while $a < $#{ $aRef } && $aRef->[ $a ] == $aRe +f->[ $a + 1 ]; ++$d, ++$b while $b < $#{ $bRef } && $bRef->[ $b ] == $bRe +f->[ $b + 1 ]; my $s = sum( $rank .. $rank + $d - 1 ) / $d; $aSum += $s * ( $a - $aSaved + 1 ); $bSum += $s * ( $b - $bSaved + 1 ); $rank += $d; ++$a, ++$b; } } $aSum += $rank++ while $a++ < @{ $aRef }; $bSum += $rank++ while $b++ < @{ $bRef }; return $aSum, $bSum; }

    The bonus is (assuming it's correct this time), is that with datasets containing many ties, it actually runs faster too.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Many thanks, again, BrowserUk!

      That looks right; also, I checked its results, not for large amounts of data, but for a few edge cases, and it works correctly for them. And it is much faster than the module I started with. I really appreciate your having done this.

      Keywords for later searchability are ranksum, rank sum.

        I think the Statistics::Data::Rank module author would be interested in these results. Posting a simple benchmark on RT might be useful.