david_lyon has asked for the wisdom of the Perl Monks concerning the following question:

Bows to the wise perlmonks

Hi I have 2 files:
FILE1: <snip> 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 </snip> FILE2: <snip> 2 1 2 2 1 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 1 2 2 2 </snip>

How do I go down each line in each file and work out the $r-squared value for each line in each file. The code I have so far as below.....MANY THANKS IN ADVANCE

#!/usr/bin/perl open (DATA, "FILE1"); @geno = <DATA>; open (DATA2, "FILE2"); @dos = <DATA2>; foreach (1..4){ $num=$_; $dosage = $dos[$num]; $genotype = $geno[$num]; rsquared($dosage, $genotype); } sub rsquared( $dosage =shift; $genotype =shift; ###HELP HERE TO FIND R-squared )

Added in more info as to what R2 is as below, thanks for everyones help The coefficient of determination R2 is used in the context of statistical models whose main purpose is the prediction of future outcomes on the basis of other related information. It is the proportion of variability in a data set that is accounted for by the statistical model.1 It provides a measure of how well future outcomes are likely to be predicted by the model. <code> Thanks for everyones help on this. I have made use of the wondeful code below... One last question in passing: A question in passing do you know why this online calculator http://easycalculation.com/statistics/r-squared.php can find r^2 (but not r) on values: "2,2,2,2" "2,2,2,1" Is that a bug on their software? Thanks again </code

Replies are listed 'Best First'.
Re: finding R-squared..please help
by Eliya (Vicar) on Feb 19, 2012 at 03:16 UTC

    Don't reinvent the wheel.  Use one of the existing libraries/packages for this.  For example Math::GSL, which is a binding to the GSL library.

    The R-squared for a linear least squares fit of two vectors @$x, @$y (with x predicting y) can be computed as follows:

    #!/usr/bin/perl -w use strict; use Math::GSL::Fit "gsl_fit_linear"; use Math::GSL::Statistics "gsl_stats_tss"; my $x = [1,2,3,4,5]; my $y = [5,7,8,12,13]; my $n = @$y; my ($status, $c0, $c1, $cov00, $cov01, $cov11, $ss_resid) = gsl_fit_linear($x, 1, $y, 1, $n); my $ss_total = gsl_stats_tss($y, 1, $n); print "SS residual = $ss_resid\n"; print "SS total = $ss_total\n"; my $R2 = 1 - $ss_resid / $ss_total; print "R^2 = $R2\n";
    SS residual = 1.9 SS total = 46 R^2 = 0.958695652173913

    If I'm understanding you correctly, you'd want to do this computation for every pair of lines.

    (Note that for R^2 to be computable, there has to be some variance in the data to be predicted (as often with statistics).  In other words, $y = [2,2,2,2,2,2] wouldn't work, because here $ss_total would be 0, which would cause a division by zero error.)

      Very nice code thank you so much Eliya! A question in passing do you know why this online calculator http://ea +sycalculation.com/statistics/r-squared.php can find r^2 (but not r) o +n values: "2,2,2,2" "2,2,2,1" Is that a bug on their software? Thanks again for your help!

        Strictly speaking, neither r nor R^2 can be computed with those vectors (so yes, you could consider it a bug).   This is because no line can be fitted with finite values for $c0 and $c1.  And when there's no line, there are no residuals, etc.

        When you change the values slightly to

        2,2,2,2.0001 2,2,2,1

        a line can be fitted (so you also get correct values on that site), but the fitted line parameters are already rather large:

        my $c0 = 20001.9999999578; # as computed by gsl_fit_linear() my $c1 = -9999.9999999789; say $c0 + $c1 * $_ for 2, 2.0001; __END__ 2 1

        The smaller you make the deviation from 2 for the last value of vector x, the larger the fitted parameters become, eventually approaching delicately balancing +/- "infinities".  And when all values are exactly 2, no fit can be computed any longer...   (try 2.0000001 and 2.00000001 on the linked site, and you'll already get nonsensical values (correct values should always be r = -1, r^2 = 1) — which means numerical precision is rather low).

Re: finding R-squared..please help
by choroba (Cardinal) on Feb 18, 2012 at 21:07 UTC
Re: finding R-squared..please help
by LonelyPilgrim (Beadle) on Feb 18, 2012 at 21:09 UTC

    I am kind of a math and science ignoramus, but I don't have any idea what "R-squared" is. Is it a math function? The square of some number you're reading? Or something else? What do you need your sub to return?

    I think, immediately, you probably need to break the lines you read into fields. You seem to be passing whole lines of data to your sub, correct? Is that what you want to do? When the sub receives them, you probably need to have something like this:

    my @dos = split(/\s*/, $dosage); # Split $dosage into an array at +the spaces between the numbers my @geno = split(/\s*/, $genotype); # Do the same thing for $genotype

    After that, I'm not sure what you want to do.

      I am kind of a math and science ignoramus, but I don't have any idea what "R-squared" is.
      That's why we have Google and Wikipedia. Really, if you don't know what it is, and are too lazy too look it up, you can also decide not to answer, and leave an answer for someone who knows what it is.

      We don't want to spoon feed people who come here with questions, and expect people asking questions to do some research. But that works both way. "R-squared" is well known enough that it doesn't need explaining -- someone who can answer the question will know what it is; someone who doesn't know what it is, is likely not being able to answer the question.

        Forgive me. I was under the mistaken impression that this forum was about Perl programming. Perl I know, and enjoy offering help regarding, and despite my other shortcomings, I hoped my comment might be helpful. I was also under the impression, perhaps mistakenly, that the person asking the question knew what the function was, and didn't need to be told this or how to work it mathematically.