Win has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

You know when you do a correlation plot in excel and get given an equation for the line in addition to the R2 value. I would like to do correlation plots within Perl, without producing graphical displays. I would then like to represent these correlations in a table. Does anyone have any experience of this and can anyone recommend modules for this? I will want to do about 10,000 correlations. Each correlation will have about 100 x-y coordinated points. These x-y coordinates will be pulled from a database.

Replies are listed 'Best First'.
Re: Correlation plots
by jettero (Monsignor) on Oct 09, 2007 at 17:46 UTC
    It's probably worth learning PDL. That seems to be the cool thing. If you have real data. I couldn't figure it out for lack of something to do with it.

    On the other hand, it's not that hard to do these by hand. Particularly in perl.

    use strict; use List::Util qw(sum); my @d = ( 1 .. 10_000 ); my $s = sum @d; my $mean = $s/@d; my $var = (sum map { ($_-$mean)**2 } @d); my $std = sqrt($var/@d); # etc...

    ... The more I think about it though, if you're pulling these from a database, you don't really even need to do the stddev by hand. I imagine your database of choice has a stddev() built in. The co-varience would probably have to be calculated by hand though. Maybe "select sum( (cola - avg(cola))*(colb - avg(colb))/count(cola) ) from tablename" ... or something like that.

    -Paul

      Your method requires that all data to be in memory at once, but it's simple to refactor it so that's not the case.

      my ($cnt, $sum, $squ); while (my ($d) = $iter->()) { $cnt++; $sum += $d; $squ += $d * $d; } my $mean = $sum / $cnt; my $var = $squ + -2*$mean*$sum + $mean*$mean*$cnt; my $std = sqrt($var/$cnt);

      while (my ($d) = $iter->()) can be replaced with any loop, including a file reading loop or a database fetching loop.

      Update: If you don't need $var anywhere else, the last two lines can be simplified to

      my $std = sqrt($squ/$cnt - $mean*$mean);

        Sure. But I started by imagining piddles, and I think you'd have to have them all in memory to use that well also. I've used PDL for a grand total of 30 minutes though, so I could be wrong.

        -Paul

Re: Correlation plots
by almut (Canon) on Oct 09, 2007 at 20:57 UTC

    If you prefer a simple specialised module over full-blown solutions like PDL or Statistics::R, you could use Statistics::LineFit (which is 27k pure Perl, without any external dependencies).

    Here's a simple example, which computes intercept, slope and R2 for one data set (100 x/y data points):

    use Statistics::LineFit; my $x = [ 1..100 ]; my $y = [ map $_*2, 11..110 ]; my $lfit = Statistics::LineFit->new(); $lfit->setData($x, $y); printf "a=%.4f, b=%.4f, R2=%.4f\n", $lfit->coefficients(), $lfit->rSq +uared(); # which outputs: a=20.0000, b=2.0000, R2=1.0000

    ( Looking at the input data, it's not too surprising that the R2 is exactly 1.0 )

    The module also allows to compute a number of related values, like residuals, standard error, etc. - in case you should need them.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Correlation plots
by mwah (Hermit) on Oct 09, 2007 at 19:38 UTC
    Win

    In order to find your real intention I'll rather ask:

    From your post, I'd assume:

    - You need a linear regression of an arbitary dataset
      which is in the form of [float(x),float(y)] x 100

    - You have 10,000 of these datasets, which you'll pull subsequently
      from a database

    - for each of the 100-coordinate data set (10,000), you'll compute some fit
      (a line slope + an intercept) and its corresponding R2 to this
      100-coordinate dataset?

    - You'll store each of the (10,000) results, together with Record-ID, slope,
      intercept and R2 subsequently into a file (10,000 rows), which is
      your new table?

    Just asking to be sure ... ;-)

    Regards

    mwa
      You are correct with all your assumptions. Further, you may be interested to know that I do not want to pull out unique data coordinates for each plot. I am planning to take random samples from a base table and perform the correlation analysis for each of these random sample sets.