Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am stumped with a question about how I can calculate the scaling factor between two data sets coming from two experiments on the same input data. I wrote a small Perl program to gather the data from Experiment-1 and Experiment-2. Input data fed into both these experiments is coming from the same source. I need to come up with a scaling factor for the data from Experiment-2, to make it as close as possible to the data in Experiment-1. Suppose, this data looks like
Experiment-1-Data Experiment-2-Data 2 3.3333 3.23 1.433 7 8.0577 9 9.7344 11.3479 13.3377

What I need to do is come up with a number to multiply with the data in the column "Experiment-2-Data", to make it as close as possible to the data in the column "Experiment-1-Data".
Is there some way to do this?
Thanks.
Bakbakallah

Replies are listed 'Best First'.
Re: Stumped with a math question in my Perl program
by roboticus (Chancellor) on Jul 21, 2010 at 04:00 UTC

    I think you ought to search CPAN for code to do a least-square curve fit. Perhaps Algorithm::CurveFit or Statistics::LineFit. Since you want a simple scale factor, then you're looking at fitting a line. So you'd minimize the value of sum(F(k))=(E1k-A*E2k)^2, where E1 is your first column, E2 is your second column, and A is your scaling factor.

    Disclaimer: My numerical analysis course was over 25 years ago, so my memory may be a little off. You'll want to double-check me, in case I've inadvertently missed (or added) something...

    ...roboticus

      Unfortunately, I don't believe either of those modules will be useful here. We aren't trying to fit a line or curve to be close to a set of (X,Y) points. We are picking the value of a single scalar in order to minimize some (unspecified) distance function. (Yes, there are probably quite a few modules on CPAN that help with exactly this type of problem, but I see little point in resorting to those, as you'll see.)

      Now, "least square" is a reasonable choice for the distance function to use. For a given value of the multiplier, we can calculate the "distance" between an Experiment-1 value and the corresponding scaled Experiment-2 value. You defined this single-pair distance via subtraction. Then we need to aggregate all of the distances to come up with an over-all distance metric in order to score how good of a multiplier we chose. Summing up the squares of the pair-wise distances is a reasonable aggregation approach. Combining subtraction with sum-of-squares gives us the "least squares" metric. One reason least squares is popular is because it can lead to some fairly simple calculations to find the (aggregated) minimum when dealing with linear equations.

      For example, if you have points X1..Xn and want to find a single value that is "close to" each of X1..Xn in aggregate and you define this aggregate distance as "least square", then the closest fit turns out to be the average, sum(Xi)/n.

      Now, many of us are familiar with data sets where "average" is not the best aggregate fit metric. People rarely talk about "average home prices" because somebody selling a $16e6 mansion can skew the average too much. So you will usually hear about "median home prices".

      So, one alternate approach would be to minimize the median distance here. That boils down to picking a multiplier such that roughly half to the scaled numbers are too high and roughly half are too low. That'd probably be a fine route to go. And it'd be easy code to write with a simple binary search on the multiplier or by just sorting the pair-wise ratios.

      But there is another aggregate distance function that I don't see used much but that makes sense to me for a lot of situations. The variation in home prices is more geometric than linear. You'll hear about home prices "going up 14%" not "going up $10,000". So instead of defining pair-wise distance via subtraction, you can define pair-wise distance via division. Two numbers are as close as possible when their ratio is exactly 1.

      Then you aggregate the ratios using "geometric mean". Instead of adding up all of the numbers and then dividing by however many numbers you have, you multiply all of the numbers together and then take the Nth root (if you have N numbers).

      Since with this specific problem we already have a list of ratios, I'd probably start with "geometric mean" as my distance function and see how well that works.

      Rather than multiplying and taking big roots, it can be more convenient to just transform all of the ratios via log(), take the mean (regular average) of those values, and just apply exp() to that mean to get the geometric mean of our original values.

      That would be trivial to calculate with about 5 lines of code, here. (No, I don't plan to write any code in this thread except in response to the OP showing some code s/he is having problems with.)

      - tye        

        tye:

        Yeah, I knew you weren't trying to find the curve for the sequences. The formula I gave was for minimizing the sum of the square of the difference between series 1 and A * series 2. That's so you could find A, the scaling factor you were looking for.

        Having said that, however, when I tried to code it up, I found that I couldn't figure out how to express it as a curve fit problem. So instead I hacked up an iterative approach:

        #!/usr/bin/perl -w use strict; use warnings; my @E1 = (2, 3.23, 7, 9, 11.3479); my @E2 = (3.3333, 1.433, 8.0577, 9.7344, 13.3377); my ($Alow, $Ahi, $Astep) = # (.5, 5, .1); # (0.8, 1.0, 0.025); # (0.85, 0.9, 0.005); (0.87, 0.88, 0.001); for (my $A=$Alow; $A <= $Ahi; $A+=$Astep) { printf "%7.4f %5.3f\n", $A, current_error($A); } sub current_error { my $A = shift; my $err=0; for (my $i=0; $i<@E1; ++$i) { my $t = $E1[$i] - $A*$E2[$i]; $err += $t*$t; } return $err; }

        And the last run gave me:

        Roboticus@Roboticus-PC ~ $ ./curvefit.pl 0.8700 5.091 0.8710 5.088 0.8720 5.086 0.8730 5.085 0.8740 5.084 0.8750 5.085 0.8760 5.085 0.8770 5.087 0.8780 5.089 0.8790 5.092 0.8800 5.096

        So it looks like your multiplier is going to be roughly 0.874. If you need to consider a constant offset, you'll (obviously) have to nest in another loop to vary the constant as well. Since you didn't indicate the need for one, I didn't bother.

        ...roboticus

Re: Stumped with a math question in my Perl program
by salva (Canon) on Jul 21, 2010 at 07:12 UTC
    I need to come up with a scaling factor for the data from Experiment-2, to make it as close as possible to the data in Experiment-1

    "as close as possible" can have several different interpretations. From a mathematical stand point, you have to define an error function and find the scaling factor that minimizes it.

    Besides that, there is an obvious solution:

    my @e1 = (2, 3.23, 7, 9, 11.3479); my @e2 = (3.3333, 1.433, 8.0577, 9.7344, 13.3377); my $sum = 0; $sum += $e2[$_]/$e1[$_] for 0..$#e1; my $f = $sum / @e1; say "average scaling factor: $f";
    That will minimize the error (e2-e1*f)/e1.
Re: Stumped with a math question in my Perl program
by ww (Archbishop) on Jul 21, 2010 at 03:48 UTC
    Yes.

    We call it "division" (or "long division").

    Divide the value of experiment-1-data by the value of experiment-2-data to ascertain the multiplier.

    Implementation in Perl is left as an exercise for the OP.

    Hint: see On asking for help and How do I post a question effectively?.

    Short form: Show some effort. This is not code-a-matic.

Re: Stumped with a math question in my Perl program
by suhailck (Friar) on Jul 21, 2010 at 03:45 UTC
    If i understood ur Qn. correctly,

    perl -ne '@arr=split;;printf "%-20s\t%-20s\t%-20s\n",$arr[0],$arr[1],( +$.==1)?"Scaling factor":$arr[0]/$arr[1]' filename