Re: Closures and Statistics

Replies are listed 'Best First'.
Re(2): Closures and Statistics by meta4 (Monk) on Jun 27, 2002 at 23:36 UTC
For starters, I think you switched a and b in the equations you give. The correct equations are: given here and here,. These equations are taken from an article on mathworld.wolfram.com on Least Squares Fitting. Notice these equations are implicit definitions of a and b. In other words both equations have both variables, so you can't simply plug in values to these equations and get a and b. Simplifying these equations is not difficult but not trivial. The details of simplification is covered in the article. The resulting equations are the equations you need to use to solve for the coefficients of the linear least squares fit, a and b. The formulas you use try to solve for a using b, and solve for b using a. You can't do that, in perl not even using eval. This, however; is all math, and a bit off topic for Perlmonks. So let's talk Perl. For this task a closure is a bit over kill as stated above. Even arrays and hash tables are excessive unless you are trying to calculate fits on several sets of data at once. In the simple case plain old variables work just fine. The following example uses the first two columns of the input as x and y and calculates a and b of the least squares fit. `#! /usr/bin/perl -w use strict; my $sum_x = 0; my $sum_y = 0; my $sum_x2 = 0; my $sum_xy = 0; my $n = 0; while( <> ) { my ( $x, $y ) = split; $sum_x += $x; $sum_y += $y; $sum_x2 += $x * $x; $sum_xy += $x * $y; $n++; } my $a = ( $sum_y * $sum_x2 - $sum_x * $sum_xy ) / ( $n * $sum_x2 - $sum_x * $sum_x ); my $b = ( $n * $sum_xy - $sum_x * $sum_y ) / ( $n * $sum_x2 - $sum_x * $sum_x ); print "a = $a; b = $b\n";` [download] This code prints out a = 0.999999714285714; b = 2.00000057142857 when given the following input -2 -3 -1 -1 0 0.99999 1 3.00001 2 5 3 7 which is veritably correct. This would be reinventing the wheel as pointed out by merlyn above for a small data set like the example. However, this implementation does not require storing the entire data set in memory. So it might be useful if the data set is large. There are other problems if the data set gets to large. The sum of the x squared's or the sum of the xy's may get too large if the data set is really big. So, you may get an overflow, or some loss of precision in the calculation with large data sets. So don't blindly trust the outputs if the data set is large. Determining how valid the fit is is off topic for Perlmonks (and a bit over my head), but there are some resources listed in the article above, and around the internet. update:* I cleaned up some spelling, and completely changed my position here.	[reply] [d/l]

Replies are listed 'Best First'.

Re(2): Closures and Statistics
by meta4 (Monk) on Jun 27, 2002 at 23:36 UTC

For starters, I think you switched a and b in the equations you give. The correct equations are: given here and here,. These equations are taken from an article on mathworld.wolfram.com on Least Squares Fitting. Notice these equations are implicit definitions of a and b. In other words both equations have both variables, so you can't simply plug in values to these equations and get a and b.

Simplifying these equations is not difficult but not trivial. The details of simplification is covered in the article. The resulting equations are the equations you need to use to solve for the coefficients of the linear least squares fit, a and b. The formulas you use try to solve for a using b, and solve for b using a. You can't do that, in perl not even using eval. This, however; is all math, and a bit off topic for Perlmonks. So let's talk Perl.

For this task a closure is a bit over kill as stated above. Even arrays and hash tables are excessive unless you are trying to calculate fits on several sets of data at once. In the simple case plain old variables work just fine.

The following example uses the first two columns of the input as x and y and calculates a and b of the least squares fit.

#! /usr/bin/perl -w 
use strict;

my $sum_x = 0;
my $sum_y = 0;
my $sum_x2 = 0;
my $sum_xy = 0;
my $n = 0;


while( <> ) {
    my ( $x, $y ) = split;
    
    $sum_x += $x;
    $sum_y += $y;
    $sum_x2 += $x * $x;
    $sum_xy += $x * $y;
    $n++;
    
}

my $a = ( $sum_y * $sum_x2 - $sum_x * $sum_xy ) /
              ( $n * $sum_x2 - $sum_x * $sum_x );  
my $b =     ( $n * $sum_xy - $sum_x * $sum_y ) /
              ( $n * $sum_x2 - $sum_x * $sum_x );  

print "a = $a; b = $b\n";
[download]

This code prints out

	a = 0.999999714285714; b = 2.00000057142857

This would be reinventing the wheel as pointed out by merlyn above for a small data set like the example. However, this implementation does not require storing the entire data set in memory. So it might be useful if the data set is large. There are other problems if the data set gets to large. The sum of the x squared's or the sum of the x*y's may get too large if the data set is really big. So, you may get an overflow, or some loss of precision in the calculation with large data sets. So don't blindly trust the outputs if the data set is large. Determining how valid the fit is is off topic for Perlmonks (and a bit over my head), but there are some resources listed in the article above, and around the internet.

update: I cleaned up some spelling, and completely changed my position here.

[reply]
[d/l]