dimmesdale has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to write a quick script fragment to find the Least Squares Regression line of some data points. To find the line of best fit y = ax + b (through the above mentioned method) with given a set of data {(x1,y1),(x2,y2),...,(xn,yn)}, the following system is solved for:

(The math is hard to write, and therefore to read, but not fundamental to reading the rest of this)

EQ1: nb + SUM(xi,i=1 TO n)a = SUM(yi,i=1 TO n)
EQ2: SUM(xi,i=1 TO n)b + SUM(xi^2,i=1 TO n)a = SUM(xi*yi, i=1 TO n)

The (x,y) points I need (quite a few of them!) are read in line by line, so the problem comes up with storage; it requires far too much space to store these, so I decided to make a closure (with the hope of) summing as we go along, and then (when passed the right argument) calculating the data (actually, I was rather happy with the code I wrote to solve the sytem of equations; when I looked at it I didn't know what to do, and thought that it was the type of thing complex C code has been written for. . . but where there's perl there's a way, and eval popped in my head, but I'd be interested to hear other ideas).

I'm not familiar with closures (I haven't used them often), so I looked through this site and at perldoc's closure information (including perlref). Well, even after all that I'm still getting some problems. Now, I imagine its a simple (as in basic/fundamental) closure issue which I'm missing, but when I call the closure (&$closure($args)) and through a debugging statement print the arguments, I get . . . NOTHING.

If closures aren't the way to go, fine (but I'd be interested in knowing what to do). The code is below (just the closure; I just made up some simple data through a 1..10 and 20..30 statement for x y coordinates and tested it with that; the call mimicks what I have above).

I wrote this just now quickly in an attempt to see what I'll need to do with the final thing, so any suggestions are more than welcome.

# bf stands for best fit; eventually I'll need several that # will mimick this format closely. my $bf_1 = sub { # Arguments are: Add|Equate; [x value; y value]. my $sum_x; my $sum_y; my $sum_x2; my $sum_xy; my $n; $sum_x = 0 unless $sum_x; $sum_y = 0 unless $sum_y; $sum_x2 = 0 unless $sum_x2; $sum_xy = 0 unless $sum_xy; $n = 0 unless $n; if($ARGV[0] eq 'Add') { my @args = @_; $sum_x += $args[1]; $sum_y += $args[2]; $sum_x2 += $args[1]*$args[2]; $sum_xy += $args[1] ** 2; $n++; } elsif ($ARGV[0] eq 'Equate') { my $a; my $b; $a = "(($sum_y - $n*$b)/$sum_x)"; $b = "($sum_xy-$sum_x2*$a)/$sum_x"; eval $b; eval $a; ($a,$b); } else { print "Called with unkown argument\n" } };

Replies are listed 'Best First'.
Re: Closures and Statistics
by broquaint (Abbot) on Jun 27, 2002 at 00:31 UTC
    The example code you gave isn't a closure but an anonymous function. A closure is when a function contains some form of lexical state within it e.g
    { open(my $fh, "somefile.txt") or die("ack - $!"); sub readfile { return <$fh>; } } print while readfile();
    So readfile holds $fh which can't be seen anywhere else in the program as it has fallen out of scope due to the enclosing bare block. For an effective example of closures see tilly's Why I like functional programming. Also be cautious of the special variables $a and $b as they are used by sort() and are package variables (so you're safe in this case as they're lexical).
    HTH

    _________
    broquaint

Re: Closures and Statistics
by bronto (Priest) on Jun 27, 2002 at 07:58 UTC

    You didn't create a closure, but an anonymous subroutine. I'll try to explain in the most simple way what a closure is.

    A classical example of a simple closure is that of the "multiplier generator":

    sub multiplier_by { my $x = shift ; return sub { my $n = shift ; return $n * $x ; } }

    Then if you run this:

    my $m2 = multiplier_by(2) ; my $m3 = multiplier_by(3) ;

    you will have:

    my $i = 5 ; print $m2->($i) ; # prints 10 print $m3->($i) ; # prints 15 # other syntaxes for $m2, $m3 invokations allowed :-)

    Now follow the execution of my $m2 = multiplier_by(2) ;: you pass a value of 2, that goes into the lexical $x and gets used in the "anonymous subroutine" definition. When the subroutine returns, you assign the return value to $m2, and here the magic happens: since you are still using it, $x is still alive somewhere in your RAM, and is accessible only inside the code inside $m2. You closed it in a cage, poor $x :-). And you have a reference to a subroutine that multiplies its argument by 2. That's your closure: ta-dah!

    When you call again multiplier_by passing 3 as argument, things still work because the newly created $x has nothing in common with the previous one. So $m3 gets a reference to a subroutine that multiplies by 3 the parameter you pass to it.

    I hope I made things clearer and not confused you more :-)

    A note on your code: since you are declaring the variables in your subroutine with my, which is a really-good-thing-to-do, the subsequent chain of $var = 0 unless $var is superfluous as it will be always executed. You could assign a value of 0 directly in the declaration: my $var = 0 ;

    Phew! I think it's better for me to stop here or my boss will fire me!

    Ciao!
    --bronto

    # Another Perl edition of a song:
    # The End, by The Beatles
    END {
      $you->take($love) eq $you->made($love) ;
    }

•Re: Closures and Statistics
by merlyn (Sage) on Jun 27, 2002 at 00:17 UTC
      Well, yes. From what I saw it looks like you have to have the whold @data array to pass it. As I said, I'm doing mine on a per line basis. Also the computer this is going to be on has a version of WinZip which for some reason isn't extracting the tar files on CPAN correctly (at all), so I'm trying to keep the modules I'm using to a minimum (plus: is it that complicated of code that I need to use a module to simpilfy it?)

      I don't need detailed statistical output of the data I'm going through (the person I'm making this for has something for this already--I'm just getting, through this and other scripts I've written, more data points to try and spot outliers, and for a few other purposes he defined).

Re: Closures and Statistics
by sfink (Deacon) on Jun 27, 2002 at 07:05 UTC
    First, to change this to use a closure so that it does what you want, you need to say something like:
    my $bf_1; { # Start a new block to avoid cluttering the environment my $sum_x = 0; $bf_1 = sub { # This block references $sum_x, # so it will be captured in the closure my ($operation, @args) = @_; # You didn't mean @ARGV; it's for command-line args if($operation eq 'Add') { $sum_x += $args[0]; } }; }
    Second, your eval"" isn't going to work. $a gets set to a string containing the actual values of the various variables you reference -- including $b, which at that point is the empty string. $b will then substitute that whole string in for $a. Not really what you want. There's no magic way of simultaneously solving two equations with textual substitution and eval"".

    And finally, you might want to proofread your calculations for $sum_x2 and $sum_xy. :-)

Re: Closures and Statistics
by rjray (Chaplain) on Jun 27, 2002 at 07:18 UTC

    I have to confess that I don't understand why you need closures at all. It appears that you are intending to use them to keep the running tallies, but you could just as easily do that with an array or hash-table. Also, there's no reason to be using eval when calculating the final values.

    --rjray

      For starters, I think you switched a and b in the equations you give. The correct equations are: given here and here,. These equations are taken from an article on mathworld.wolfram.com on Least Squares Fitting. Notice these equations are implicit definitions of a and b. In other words both equations have both variables, so you can't simply plug in values to these equations and get a and b.

      Simplifying these equations is not difficult but not trivial. The details of simplification is covered in the article. The resulting equations are the equations you need to use to solve for the coefficients of the linear least squares fit, a and b. The formulas you use try to solve for a using b, and solve for b using a. You can't do that, in perl not even using eval. This, however; is all math, and a bit off topic for Perlmonks. So let's talk Perl.

      For this task a closure is a bit over kill as stated above. Even arrays and hash tables are excessive unless you are trying to calculate fits on several sets of data at once. In the simple case plain old variables work just fine.

      The following example uses the first two columns of the input as x and y and calculates a and b of the least squares fit.

      #! /usr/bin/perl -w use strict; my $sum_x = 0; my $sum_y = 0; my $sum_x2 = 0; my $sum_xy = 0; my $n = 0; while( <> ) { my ( $x, $y ) = split; $sum_x += $x; $sum_y += $y; $sum_x2 += $x * $x; $sum_xy += $x * $y; $n++; } my $a = ( $sum_y * $sum_x2 - $sum_x * $sum_xy ) / ( $n * $sum_x2 - $sum_x * $sum_x ); my $b = ( $n * $sum_xy - $sum_x * $sum_y ) / ( $n * $sum_x2 - $sum_x * $sum_x ); print "a = $a; b = $b\n";

      This code prints out

      	a = 0.999999714285714; b = 2.00000057142857
      
      when given the following input

      -2 -3
      -1 -1
       0  0.99999
       1  3.00001
       2  5
       3  7
      
      which is veritably correct.

      This would be reinventing the wheel as pointed out by merlyn above for a small data set like the example. However, this implementation does not require storing the entire data set in memory. So it might be useful if the data set is large. There are other problems if the data set gets to large. The sum of the x squared's or the sum of the x*y's may get too large if the data set is really big. So, you may get an overflow, or some loss of precision in the calculation with large data sets. So don't blindly trust the outputs if the data set is large. Determining how valid the fit is is off topic for Perlmonks (and a bit over my head), but there are some resources listed in the article above, and around the internet.

      update: I cleaned up some spelling, and completely changed my position here.

Re: Closures and Statistics
by meta4 (Monk) on Jun 27, 2002 at 23:56 UTC

    After writing this response I realized that you might be trying to calculate the least squares fit for multiple data sets at the same time. If you are, then the advice I gave isn't very useful. It turns out that closures might not be such a bad idea after all. Here is my attempt to solve the problem using closures.

    #! /usr/bin/perl -w use strict; # Create an array of closures my @LS_closures = ( create_LS_closure(), create_LS_closure(), ); while( <> ) { my ( $x1, $y1, $x2, $y2 ) = split; # add the current data to the closure $LS_closures[0]{add}->( $x1, $y1 ); $LS_closures[1]{add}->( $x2, $y2 ); } # calculate the least squares coefficient my ( $a, $b ) = $LS_closures[0]{calcCoefs}->(); print "for the first set of data a = $a; b = $b\n"; ( $a, $b ) = $LS_closures[1]{calcCoefs}->(); print "for the second set of data a = $a; b = $b\n"; # ----------------------------------------------- # The closure generating subroutine sub create_LS_closure { # these variable are unique to each closure # created my $sum_x = 0; my $sum_y = 0; my $sum_x2 = 0; my $sum_xy = 0; my $n = 0; return { # This is the anonymous function for adding # data a to the least squares fit add => sub { my ($x, $y) = @_; $sum_x += $x; $sum_y += $y; $sum_x2 += $x * $x; $sum_xy += $x * $y; $n++; }, # This is the anonymous function for # computing the coefficients of the # least squares fit calcCoefs => sub { my $a = ( $sum_y * $sum_x2 - $sum_x * $sum_xy ) / ( $n * $sum_x2 - $sum_x * $sum_x ); my $b = ( $n * $sum_xy - $sum_x * $sum_y ) / ( $n * $sum_x2 - $sum_x * $sum_x ); return ($a, $b); } } }

    When given the following input

    -2 -3 -2  3
    -1 -1 -1  1
     0  1  0 -1
     1  3  1 -3
     2  5  2 -5
     3  7  3 -7
    
    this program outputs
    for the first set of data a = 1; b = 2
    for the second set of data a = -1; b = -2
    
    which is exactly correct

    The calling code appears to be quite flexible, and the closure could probably be moved to a module. So it appears that using closures to calculate least squares fits might not be such a bad idea!

    Comments? Questions?

    enjoy

    meta4