Re^2: Stumped with a math question in my Perl program (log scale)

Unfortunately, I don't believe either of those modules will be useful here. We aren't trying to fit a line or curve to be close to a set of (X,Y) points. We are picking the value of a single scalar in order to minimize some (unspecified) distance function. (Yes, there are probably quite a few modules on CPAN that help with exactly this type of problem, but I see little point in resorting to those, as you'll see.)

Now, "least square" is a reasonable choice for the distance function to use. For a given value of the multiplier, we can calculate the "distance" between an Experiment-1 value and the corresponding scaled Experiment-2 value. You defined this single-pair distance via subtraction. Then we need to aggregate all of the distances to come up with an over-all distance metric in order to score how good of a multiplier we chose. Summing up the squares of the pair-wise distances is a reasonable aggregation approach. Combining subtraction with sum-of-squares gives us the "least squares" metric. One reason least squares is popular is because it can lead to some fairly simple calculations to find the (aggregated) minimum when dealing with linear equations.

For example, if you have points X1..Xn and want to find a single value that is "close to" each of X1..Xn in aggregate and you define this aggregate distance as "least square", then the closest fit turns out to be the average, sum(Xi)/n.

Now, many of us are familiar with data sets where "average" is not the best aggregate fit metric. People rarely talk about "average home prices" because somebody selling a $16e6 mansion can skew the average too much. So you will usually hear about "median home prices".

So, one alternate approach would be to minimize the median distance here. That boils down to picking a multiplier such that roughly half to the scaled numbers are too high and roughly half are too low. That'd probably be a fine route to go. And it'd be easy code to write with a simple binary search on the multiplier or by just sorting the pair-wise ratios.

But there is another aggregate distance function that I don't see used much but that makes sense to me for a lot of situations. The variation in home prices is more geometric than linear. You'll hear about home prices "going up 14%" not "going up $10,000". So instead of defining pair-wise distance via subtraction, you can define pair-wise distance via division. Two numbers are as close as possible when their ratio is exactly 1.

Then you aggregate the ratios using "geometric mean". Instead of adding up all of the numbers and then dividing by however many numbers you have, you multiply all of the numbers together and then take the Nth root (if you have N numbers).

Since with this specific problem we already have a list of ratios, I'd probably start with "geometric mean" as my distance function and see how well that works.

Rather than multiplying and taking big roots, it can be more convenient to just transform all of the ratios via log(), take the mean (regular average) of those values, and just apply exp() to that mean to get the geometric mean of our original values.

That would be trivial to calculate with about 5 lines of code, here. (No, I don't plan to write any code in this thread except in response to the OP showing some code s/he is having problems with.)

- tye

Comment on Re^2: Stumped with a math question in my Perl program (log scale)

Replies are listed 'Best First'.
Re^3: Stumped with a math question in my Perl program (log scale) by roboticus (Chancellor) on Jul 21, 2010 at 14:53 UTC
tye: Yeah, I knew you weren't trying to find the curve for the sequences. The formula I gave was for minimizing the sum of the square of the difference between series 1 and A * series 2. That's so you could find A, the scaling factor you were looking for. Having said that, however, when I tried to code it up, I found that I couldn't figure out how to express it as a curve fit problem. So instead I hacked up an iterative approach: `#!/usr/bin/perl -w use strict; use warnings; my @E1 = (2, 3.23, 7, 9, 11.3479); my @E2 = (3.3333, 1.433, 8.0577, 9.7344, 13.3377); my ($Alow, $Ahi, $Astep) = # (.5, 5, .1); # (0.8, 1.0, 0.025); # (0.85, 0.9, 0.005); (0.87, 0.88, 0.001); for (my $A=$Alow; $A <= $Ahi; $A+=$Astep) { printf "%7.4f %5.3f\n", $A, current_error($A); } sub current_error { my $A = shift; my $err=0; for (my $i=0; $i<@E1; ++$i) { my $t = $E1[$i] - $A$E2[$i]; $err += $t$t; } return $err; }` [download] And the last run gave me: `Roboticus@Roboticus-PC ~ $ ./curvefit.pl 0.8700 5.091 0.8710 5.088 0.8720 5.086 0.8730 5.085 0.8740 5.084 0.8750 5.085 0.8760 5.085 0.8770 5.087 0.8780 5.089 0.8790 5.092 0.8800 5.096` [download] So it looks like your multiplier is going to be roughly 0.874. If you need to consider a constant offset, you'll (obviously) have to nest in another loop to vary the constant as well. Since you didn't indicate the need for one, I didn't bother. ...roboticus	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Stumped with a math question in my Perl program (log scale)
by roboticus (Chancellor) on Jul 21, 2010 at 14:53 UTC

tye:

Yeah, I knew you weren't trying to find the curve for the sequences. The formula I gave was for minimizing the sum of the square of the difference between series 1 and A * series 2. That's so you could find A, the scaling factor you were looking for.

Having said that, however, when I tried to code it up, I found that I couldn't figure out how to express it as a curve fit problem. So instead I hacked up an iterative approach:

#!/usr/bin/perl -w
use strict;
use warnings;

my @E1 = (2, 3.23, 7, 9, 11.3479);
my @E2 = (3.3333, 1.433, 8.0577, 9.7344, 13.3377);

my ($Alow, $Ahi, $Astep) =
#       (.5, 5, .1);
#       (0.8, 1.0, 0.025);
#       (0.85, 0.9, 0.005);
        (0.87, 0.88, 0.001);

for (my $A=$Alow; $A <= $Ahi; $A+=$Astep) {
        printf "%7.4f %5.3f\n", $A, current_error($A);
}

sub current_error {
        my $A = shift;
        my $err=0;
        for (my $i=0; $i<@E1; ++$i) {
                my $t = $E1[$i] - $A*$E2[$i];
                $err += $t*$t;
        }
        return $err;
}
[download]

And the last run gave me:

Roboticus@Roboticus-PC ~
$ ./curvefit.pl
 0.8700 5.091
 0.8710 5.088
 0.8720 5.086
 0.8730 5.085
 0.8740 5.084
 0.8750 5.085
 0.8760 5.085
 0.8770 5.087
 0.8780 5.089
 0.8790 5.092
 0.8800 5.096
[download]

So it looks like your multiplier is going to be roughly 0.874. If you need to consider a constant offset, you'll (obviously) have to nest in another loop to vary the constant as well. Since you didn't indicate the need for one, I didn't bother.

...roboticus

[reply]
[d/l]
[select]