Win has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,


I'm concerned that I'm going to get myself in a pickle with this. Please can someone point me in the right direction? I want to get the sub to return distance from the trend line (by y axis) as well as whether each point is above or below the trend line.
sub correlation { my ($x_ref, $y_ref) = @_; my @array_of_x_values = @$x_ref; my @array_of_y_values = @$y_ref; my $output = "output_test.txt"; open (OUTPUT_TEST, ">$output"); print OUTPUT_TEST Dumper ($x_ref, $y_ref); #print Dumper length($x_ref); #print Dumper length($y_ref); my $lfit = Statistics::LineFit->new(); $lfit->setData($x_ref, $y_ref); my ($intercept, $slope) = $lfit->coefficients(); my $rSquared = $lfit->rSquared(); my $hashref = { a => $intercept, b => $slope, c => $rSquared }; my $number_of_elements_in_array = -1; my @x_values_for_points_above_the_trend_line; my @y_values_for_points_above_the_trend_line; my @x_values_for_points_below_the_trend_line; my @y_values_for_points_below_the_trend_line; my @x_values_for_points_on_the_trend_line; my @y_values_for_points_on_the_trend_line; foreach (@array_of_x_values){ #chomp; $number_of_elements_in_array++; my $x_point = $_; # print "\n$array_of_y_values[$number_of_elements_in_array] > $inte +rcept + ($x_point * $slope)\n\n"; if (defined($x_point)){ if ($array_of_y_values[$number_of_elements_in_array] > $intercep +t + ($x_point * $slope)){ #points are above the trend line #### my $Distance_above_trend_line = push (@x_values_for_points_above_the_trend_line, $x_point); push (@y_values_for_points_above_the_trend_line, $array_of_y_value +s[$number_of_elements_in_array]); } elsif ($array_of_y_values[$number_of_elements_in_array] < $inter +cept + ($x_point * $slope)){ #points are below the trend line # $y_values_for_points_below_the_trend_line = push (@x_values_for_points_below_the_trend_line, $x_point); push (@y_values_for_points_below_the_trend_line, $array_of_y_value +s[$number_of_elements_in_array]); } else { push (@x_values_for_points_on_the_trend_line, $x_point); push (@y_values_for_points_on_the_trend_line, $array_of_y_values[$ +number_of_elements_in_array]); } } } return $hashref, \@x_values_for_points_above_the_trend_line, \@y_val +ues_for_points_above_the_trend_line, \@x_values_for_points_below_the_ +trend_line, \@y_values_for_points_below_the_trend_line, \@x_values_fo +r_points_on_the_trend_line, \@y_values_for_points_on_the_trend_line; }

Replies are listed 'Best First'.
Re: returning distance from trendline
by jdporter (Paladin) on Jan 14, 2008 at 16:39 UTC
    #print Dumper length($x_ref);
    #print Dumper length($y_ref);
    

    I think it's pretty sad that at this point you still don't know how to find the length of an array. Anyway... Here's your function modified for the info you wanted. But personally I find the idea of having two parallel arrays, one for X and one for Y, to be inelegant. I'd rather have a single array containing "points", where each point would be [ x, y ] . In fact, additional info could be added to each "point", e.g. [ x, y, distance_from_line ] , etc. Then you could think about making a Point object class. But maybe I'm going too fast for you...

    sub correlation { my( $x_ref, $y_ref ) = @_; my @x_values = @$x_ref; my @y_values = @$y_ref; my $output = "output_test.txt"; open OUTPUT_TEST, ">$output" or die "can't write $output - $!"; print OUTPUT_TEST Dumper ($x_ref, $y_ref); close OUTPUT_TEST; my $lfit = Statistics::LineFit->new(); $lfit->setData( $x_ref, $y_ref ); my( $intercept, $slope ) = $lfit->coefficients(); my $rSquared = $lfit->rSquared(); my $hashref = { a => $intercept, b => $slope, c => $rSquared }; my( @x_values_for_points_above_the_trend_line, @y_values_for_points_above_the_trend_line, @distances_above_the_trend_line, @x_values_for_points_below_the_trend_line, @y_values_for_points_below_the_trend_line, @distances_below_the_trend_line, @x_values_for_points_on_the_trend_line, @y_values_for_points_on_the_trend_line, ); for ( 0 .. $#x_values ) { my( $x, $y ) = ( $x_values[$_], $y_values[$_] ); my $fx = $intercept + ( $x * $slope ); if ( $y > $fx ) { push @x_values_for_points_above_the_trend_line, $x; push @y_values_for_points_above_the_trend_line, $y; push @distances_above_trend_line, $y - $fx; } elsif ( $y < $fx ) { push @x_values_for_points_below_the_trend_line, $x; push @y_values_for_points_below_the_trend_line, $y; push @distances_below_trend_line, $fx - $y; } else { push @x_values_for_points_on_the_trend_line, $x; push @y_values_for_points_on_the_trend_line, $y; } } return( $hashref, \@x_values_for_points_above_the_trend_line, \@y_values_for_points_above_the_trend_line, \@distances_above_the_trend_line, \@x_values_for_points_below_the_trend_line, \@y_values_for_points_below_the_trend_line, \@distances_below_the_trend_line, \@x_values_for_points_on_the_trend_line, \@y_values_for_points_on_the_trend_line, ); }

    (Disclaimer: Code Untested.)

    A word spoken in Mind will reach its own level, in the objective world, by its own weight
Re: returning distance from trendline
by almut (Canon) on Jan 14, 2008 at 18:56 UTC

    The values you want to compute are called residuals, and incidentally, Statistics::LineFit does offer a method to compute them, e.g. my @resid = $lfit->residuals();  (The general idea is that you have some model (the trend line equation, in this case), which explains/predicts some of the variation found in the data. The deviations from the model are called residuals.)

    OTOH, in order to get a better understanding of what's going on, there's nothing wrong with writing the code yourself, in particular as it's rather straightforward. Similarly, if you were to compute residuals for other points (i.e. points which were not used to compute the parameters of the trend equation), you'd have to code it yourself anyway (Statistics::LineFit's residuals method doesn't do the latter).

    In your case, positive residuals are above, and negative residuals below the trend line. What's "on" the line is a matter of definition. Generally, in mathematics, a line is infinitisimally thin, so practically nothing ever really lies "on" it (and a direct comparison of floats would be subject to round-off errors)... but you can specify a certain +/- range around it, and check whether a certain value lies within that range.

    Here's what the check could look like, essentially (I'll leave it to you to integrate that into your routine):

    my $delta = 0.01; # choose some sensible value (defines range) for my $i (0 .. $#array_of_x_values) { my ($x, $y) = ($array_of_x_values[$i], $array_of_y_values[$i]) +; my $y_trend = $intercept + ($x * $slope); my $resid = $y - $y_trend; if ($resid > -$delta and $resid < $delta) { # points are "on" (i.e. within some small range around) tr +end line # ... } else { if ($resid > 0) { # points are above the trend line # ... } else { # points are below the trend line # ... } } }
Re: returning distance from trendline
by apl (Monsignor) on Jan 14, 2008 at 16:38 UTC
    points to Statistics::PointEstimation. Take a look at the documentation and see if it meets your needs.

    I haven't used it, and hope I'm not pointing you in the wrong direction...