glwtta has asked for the wisdom of the Perl Monks concerning the following question:

I have an application that needs to compute a lot of correlations (pearson and spearman) and speed has become quite an issue. I've reduced the problem to the simplest possible form:
for($i = 0; $i < @data; $i++){ for($j = $i + 1; $j < @data; $j++){ $pears = sum($data[$i] * $data[$j]) / $samples; } }
@data contains several tens of thousands of piddles with precomputed standard scores; $samples is the number of values in each piddle (side note: I would've thought that should be '$samples - 1', but this gives the same result as all the public implementations I've tried).

So my question is: if I were to reimplement this part in C, would I expect some sort of significant speedup? I'm looking for a very general ballpark, like 2-3x, or "don't bother, probably none at all".

Also I was wondering if stuffing the whole thing into a 2D piddle, rather than keeping it in a Perl array might offer some benefit? I'm inclined to think that probably not, but you never know. (memory is not an issue at all here, just execution time).

What I have right now is probably fast enough to make the project feasable - the largest dataset I have would take around 12 hours; but I have hundreds of these datasets, and even though I have the hardware to do quite a few in parallel and most are much much smaller than the largest, reducing the run-time even a little at this stage might save quite a bit of time by the time I'm done.

Any thoughts? Or other ideas I haven't thought of?

Replies are listed 'Best First'.
Re: PDL vs C speed question
by gam3 (Curate) on May 06, 2005 at 02:07 UTC
    You example could be a more clear. You don't give the type of @data and $pears does not seem to be being used. Also you don't say how the data array is getting loaded or what the maximum size is. If data will fit into a C array of doubles I think that you might see a speedup of as much as 50 fold.

    For 10,000 datapoints I was getting 6 seconds for C code and 340 seconds for the perl.

    There may be a faster way to do this in Perl though.
    -- gam3
    A picture is worth a thousand words, but takes 200K.
      Sorry, should clarify: @data is an array of PDL values. The piddles themselves are simple one dimensional vectors of floats. The maximum size of @data is about 45000, the piddles usually have around 100 values.

      $pears is not used because that's the end result I'm going after, it will just get stored to a file or database, there isn't much there, optimization-wise. Similarly, it doesn't matter how @data gets loaded, the time that it takes is negligible compared to the main computation.

      So @data is preloaded with PDL vectors of floats, which are the z-scores of the original values. All I need is the fastest way to divide the sum of the products of these standard scores, for each pair of elements of @data, by the size of the piddles, to get the correlation coefficient (they are all the same size, and I know the size from the start).

      Say N is the number of elements in @data, and M is the size of those elements; as far as I can tell I need to perform a minimum of N^2/2*M multiplications, N^2/2*(M-1) substractions and N^2/2 divisions - I just don't know how much slower than C perl (and PDL specifically) is at such things.

      btw, I am not quite sure what your perl example is doing (especially the sum() sub), I don't think that's what I have in mind.

      Should be pretty obvious that I know very little C, though I am sure I should be able to scrape togther something this trivial if it's worth it (and 50X speed up would definitely be worth it, if that is actually the case).

      Thanks for the help

Re: PDL vs C speed question
by DrHyde (Prior) on May 06, 2005 at 10:34 UTC
    Once you've completed your code, would it be possible for you to release it onto the CPAN, or submit it as a patch to (eg) Statistics::RankCorrelation?
      I can't really see anyone else being able to use this. The original implementation is only faster because I know that I'll be comparing the same vectors many times, so I can precompute the expensive things (like the z scores and the standard divs). It's only useful in a small number of cases and is a trivial thing to implement.

      As for the C stuff, I don't know anything about XS (though it seems like it's not easy) so I couldn't really provide a Perl interface to it. What I am doing is preprocessing the data into an easily digestible format in Perl then just packing it and piping it to the script I posted; the script will also need to collect the results, do a bunch of sorting - I only need the top couple hundred values for each vector - and probably dump them to a file that a perl script then will load into a database.

      So, I minimize the time I spend in C and still remove the huge (quite an understatement there) bottleneck. I'm afraid I don't have the time or the knowhow to package this for external consumption.

Re: PDL vs C speed question
by glwtta (Hermit) on May 06, 2005 at 23:03 UTC
    Holy crap!

    So I decided to try it after all, I tested the C implementation (see below) on the same dataset that took PDL 80m44s, and it finished in 1m42s! Simply compiling with -O3 reduced that further to 26 seconds! Or about 186 times faster.

    That pretty much answers my initial question. :) The only reason I was reticent to try this from the beginning is that I know almost nothing about C, so even these simple things take me a while.

    I'm including the C program if anyone is interested, but it's really embarassingly trivial.

Re: PDL vs C speed question
by Roy Johnson (Monsignor) on May 06, 2005 at 23:34 UTC
    You can reduce the number of math ops and save some time. It won't make the Perl compete with the C, but you can apply the same optimization to the C, and make it that much faster. Basically, I'm factoring the multiplication out of the inner loop. The division can be done after all the looping.
    for my $i (0..$#data){ my $sum_j = 0; $sum_j += $data[$_] for $i..$#data; $pears += $data[$i] * $sum_j; }
    This was about 3x faster than (my interpretation of) your original, for me. Some of that comes from changing the looping constructs, but a chunk of it comes from fewer math ops.

    Caution: Contents may have been coded under pressure.
      I don't think I understand your code - is @data the same as my @data?

      I think what's confusing people here is the PDL notation - my fault, I should've given the pure Perl version - since $data[$i] and $data[$j] are PDL vectors, the '*' operator is overloaded to return a vector of the same length as its arguments, with every component being the product of the corresponding components of the input vectors. The sum() function simply returns the sum of the vector components.

      The Perl version of my C example would look like this, assuming @data is an array of arrayrefs, rather than an array of PDL vectors:

      for ($i = 0; i < @data; $i++) { for($j = $i + 1; $j < @data; $j++) { my $sum = 0; $sum += $data[$i][$_] * $data[$j][$_] for(0..@{$data[$i]}-1); my $pears = $sum / @{$data[$i]}; } }
      I really don't think the math here can be simplified any more; or else I need to brush up on basic arithmetic :)

      minor edit: forgot that square brackets get interpreted

Re: PDL vs C speed question
by Anonymous Monk on May 09, 2005 at 01:46 UTC
    Nearly off-topic, but have you thought of using the statistics package R for some of these calculations? There is a perl module Statistics::R that implements a bridge between perl and R. Doing correlations and such in R is generally C-based, but with some vectorized optimizations for matrix operations. You might try some of your calculations to see what speed gains you might see (probably significant).

    Sean
      Yeah, I do use R a lot, though the last time I tested it, raw computation speed wasn't its biggest strength. A huge win for R would be a threaded implementation (I think there's work being done on this?), seeing how it already has a very parallel computation friendly syntax.

      R is pretty cumbersome to use from Perl though, I think I like the C approach better in this case.