in reply to Re: Seeking a fast sum_of_ranks_between function
in thread Seeking a fast sum_of_ranks_between function

Thanks for posting the request for clarification.

The function takes two arrayrefs of numbers, listing all the elements from both arrays in order, ranking them, and returning the sum of ranks for each array. For example, suppose we have my @one=(1..10); my @two=(3.14,4.25,5.36,6.47,7.58);. Then the elements, in order, are 1, 2, 3, 3.14, 4, 4.25, 5, 5.36, 6, 6.47, 7, 7.58, 8, 9, 10, so the ranks are:

1, 2, 3, 3.14, 4, 4.25, 5, 5.36, 6,  6.47,  7,  7.58,  8,  9, 10
1, 2, 3, 4   , 5, 6   , 7, 8   , 9, 10   , 11, 12,    13, 14, 15 # ranks

Then @one would have the ranks 1, 2, 3, 5, 7, 9, 11, 13, 14, 15, for a sum of 80, and @two would have the ranks 4, 6, 8, 10, 12, for a sum of 40, so the function could return {one=>80, two=>40}.

(That's not exactly how the module's function works: it takes a hashref as an argument, and… whatever. But the point is that it needs to "know about" two arrays and return a number for each.)

It gets more complicated when there are multiple copies of the same number. If @one=(1..5); @two=(3,3.14,4,4); then the ranking is

1, 2, 3  , 3  , 3.14, 4, 4, 4, 5
1, 2, 3.5, 3.5, 5   , 7, 7, 7, 9 # ranks

because each element is given the arithmetic mean (average) of the possible ranks it qualifies for (the mean of 6, 7, and 8 is 7; the mean of 3 and 4 is 3.5).

Replies are listed 'Best First'.
Re^3: Seeking a fast sum_of_ranks_between function
by Laurent_R (Canon) on Oct 13, 2015 at 23:15 UTC
    I have gathered that you have two subsets, each with about a million data points. And that you say it is very slow. How slow is "very slow"? Can you give a time frame?

    I am asking because it does not appear to me that it should be very slow (unless I missed a very time-consuming step in the algorithm description). It could be that the module is doing a number of things not really necessary in your specific case and that rolling out your own sub or set of subs might be faster, or that it could be optimized some other way. But is "very slow" is a few seconds or many hours? In the latter case (assuming I have understood what needs to be done), I am convinced it could be faster. A few seconds, I would not even try to defeat the module in terms or performance. In between, well, then, it is is your and our draw. Please give a better estimate of the time frame.

    Of course, any further action would require a representative dataset (perhaps significantly smaller, but sufficient for benchmarking.

    @ BrowserUK: thanks for asking, I thought about more or less the same things, but somehow did not dare to ask.

      Hours.   I suspect that, as you suggest, "the module is doing a number of things not really necessary in [my] specific case".   Thanks for looking into this.