Correlation plots

Win has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Correlation plots by jettero (Monsignor) on Oct 09, 2007 at 17:46 UTC
It's probably worth learning PDL. That seems to be the cool thing. If you have real data. I couldn't figure it out for lack of something to do with it. On the other hand, it's not that hard to do these by hand. Particularly in perl. `use strict; use List::Util qw(sum); my @d = ( 1 .. 10_000 ); my $s = sum @d; my $mean = $s/@d; my $var = (sum map { ($_-$mean)*2 } @d); my $std = sqrt($var/@d); # etc...` [download] ... The more I think about it though, if you're pulling these from a database, you don't really even need to do the stddev by hand. I imagine your database of choice has a stddev() built in. The co-varience would probably have to be calculated by hand though. Maybe "`select sum( (cola - avg(cola))(colb - avg(colb))/count(cola) ) from tablename`" ... or something like that. -Paul	[reply] [d/l] [select]
Re^2: Correlation plots by ikegami (Patriarch) on Oct 09, 2007 at 18:37 UTC
Your method requires that all data to be in memory at once, but it's simple to refactor it so that's not the case. `my ($cnt, $sum, $squ); while (my ($d) = $iter->()) { $cnt++; $sum += $d; $squ += $d * $d; } my $mean = $sum / $cnt; my $var = $squ + -2$mean$sum + $mean$mean$cnt; my $std = sqrt($var/$cnt);` [download] `while (my ($d) = $iter->())` can be replaced with any loop, including a file reading loop or a database fetching loop. Update: If you don't need `$var` anywhere else, the last two lines can be simplified to `my $std = sqrt($squ/$cnt - $mean*$mean);` [download]	[reply] [d/l] [select]
Re^3: Correlation plots by jettero (Monsignor) on Oct 09, 2007 at 18:59 UTC
Sure. But I started by imagining piddles, and I think you'd have to have them all in memory to use that well also. I've used PDL for a grand total of 30 minutes though, so I could be wrong. -Paul	[reply]
Re^4: Correlation plots by ikegami (Patriarch) on Oct 09, 2007 at 19:08 UTC
Re^5: Correlation plots by jettero (Monsignor) on Oct 11, 2007 at 11:20 UTC
Re: Correlation plots by almut (Canon) on Oct 09, 2007 at 20:57 UTC
If you prefer a simple specialised module over full-blown solutions like PDL or Statistics::R, you could use Statistics::LineFit (which is 27k pure Perl, without any external dependencies). Here's a simple example, which computes intercept, slope and R² for one data set (100 x/y data points): `use Statistics::LineFit; my $x = [ 1..100 ]; my $y = [ map $_*2, 11..110 ]; my $lfit = Statistics::LineFit->new(); $lfit->setData($x, $y); printf "a=%.4f, b=%.4f, R2=%.4f\n", $lfit->coefficients(), $lfit->rSq +uared(); # which outputs: a=20.0000, b=2.0000, R2=1.0000` [download] ( Looking at the input data, it's not too surprising that the R² is exactly 1.0 ) The module also allows to compute a number of related values, like residuals, standard error, etc. - in case you should need them.	[reply] [d/l]
A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Correlation plots by mwah (Hermit) on Oct 09, 2007 at 19:38 UTC
Win: In order to find your real intention I'll rather ask: From your post, I'd assume: - You need a linear regression of an arbitary dataset which is in the form of [float(x),float(y)] x 100 - You have 10,000 of these datasets, which you'll pull subsequently from a database - for each of the 100-coordinate data set (10,000), you'll compute some fit (a line slope + an intercept) and its corresponding R² to this 100-coordinate dataset? - You'll store each of the (10,000) results, together with Record-ID, slope, intercept and R² subsequently into a file (10,000 rows), which is your new table? Just asking to be sure ... ;-) Regards mwa	[reply]
Re^2: Correlation plots by Win (Novice) on Oct 10, 2007 at 12:19 UTC
You are correct with all your assumptions. Further, you may be interested to know that I do not want to pull out unique data coordinates for each plot. I am planning to take random samples from a base table and perform the correlation analysis for each of these random sample sets.	[reply]