BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

In RE: Data Normalization, lonewolf28 asked about normalising some data; and I posted a couple of subroutines that do linear and log scaling.

Thinking about this some more, I wondered if there was any way to programmically decide from a given set of data the most appropriate method of scaling to use.

Often the human being can do this by inspection. eg.This is fairly obviously linear:

5 5 34 44 114 169 177 184 270 339 361 364 442 511 530 554 555 587 709 +709 735 778 791 859 871 899 903 926 933 952

This is log2:

0, 1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, 8191, 16383, +32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 67108863, 1342 +17727, 268435455, 536870911, 1073741823

And this log10:

1.713125e-005, 1.748086e-006, 2.101463e-006, 1.977405e-006, 3.597675e- +006, 3.725492e-006, 3.924736e-006, 2.902199e-006, 3.988645e-006, 8.210367e-006, 3.360837e-006, 5.202907e-006, 7.082570e- +006, 8.778026e-006, 7.079562e-005, 9.100576e-005, 5.258545e-005, 9.292677e-005, 1.789815e-004, 2.113948e-003, 7.229146e- +004, 1.428995e-003, 2.742045e-003, 5.552746e-003, 1.822390e-002, 2.220999e-002, 4.316067e-002, 8.876963e-002, 1.751072e- +001, 3.494051e-001, 7.155960e-001, 1.347822e+000

But how could a program decide that?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

Replies are listed 'Best First'.
Re: Data range detection?
by hdb (Monsignor) on Apr 13, 2015 at 07:33 UTC

    You do a scatterplot with the data as is, on a logscale, on a log-log-scale (potentially other transformations) and then pick the one that looks "most linear", ie the one with the best fit after linear regression.

      I was really looking for an automated method that would give a 'reasonable result' for most inputs, without human intervention.

      Maybe it's not possible, but it is worth asking. How to program "looks right"?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        I may be thinking of this too simplistically, but could you not do something along the lines of:

        For each type of proposed pattern,

        • Specify functions Fx and Fy as the functions to apply to the x and y values, respectively.
        • Given that for a straight line, (y2 - y1) = m(x2 - x1), use the equation of a line in the form ( Fy(y2) - Fy(y1) ) = m ( Fx(x2) - Fx(x1) )
        • Use the first and last data values to compute a value m
        • Compute the estimated values for the internal data points
        • Compute the error of the estimated values
        • Select proposed pattern based on the one providing the smallest error

        Is this a realistic idea, or am I thinking too simply?

        Hope that helps.

        There are various options to measure the "goodness of fit". It is at times also highly debated what the best criterion is. You could look for example at the R^2 of a linear fit and choose the scale with the biggest one.

Re: Data range detection?
by QM (Parson) on Apr 13, 2015 at 07:45 UTC
    Expanding on hdb's idea, you could do what I did on a "programming lab" assignment for Analog Circuits: Keep the curve the same, and change the scales to suit.

    Granted, there were only 3 types of graphs to draw (overdamped, underdamped, and critically damped), but it was straightforward to determine these. Given the right choice, the time and amplitude axes were scaled to suit.

    In the OP, you may be able to setup an equation and solve for the kind of curve. You can do what they do in when developing statistical models, which is throw every kind of term into the model, and see which terms correlate. There are variations where one term at a time is dropped, and the regression rerun. When only one term is left, that is usually the best one. (Though sometimes a polynomial will appear linear and such, but then, it was nearly linear to begin with.)

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

      Granted, there were only 3 types of graphs to draw (overdamped, underdamped, and critically damped), but it was straightforward to determine these.

      I think what I take from that is: if you know what the data represents, you (the human) can pre-select a set of possible scalings and programmically choose the best one.

      But that doesn't really help for the general case where the data the program gets could be anything?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Data range detection?
by RichardK (Parson) on Apr 13, 2015 at 12:06 UTC

    if you look at the median, interquartile range and mean of your data that should let you decide how it is distributed. Roughly speaking, if the median is close to the mean then the data is fairly linear. But the interquartile_range will give you a better view.

    I'm not sure how you could detect the base of log data, it will all look the same as it's trivial to convert from one base to another.

      if you look at the median, interquartile range and mean of your data that should let you decide how it is distributed.

      The problem is that there is no expectation that the input values will or should be evenly distributed. Ie. They could be clumped at both ends, or predominantly more at one end than the other; or any other variation.

      My gut tells me you're pointing me in the right direction; but I'm not seeing how to make use of the suggestion?

      I'm not sure how you could detect the base of log data, it will all look the same as it's trivial to convert from one base to another.

      Indeed. The log base doesn't matter. The point was only that certain datasets are very obviously logarithmic by (human) inspection.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked
Re: Data range detection?
by roboticus (Chancellor) on Apr 13, 2015 at 10:42 UTC

    BrowserUk:

    I think I'd try treating the axes separately, and for each, find the scale that would most evenly distribute the data points along the axis.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      think I'd try treating the axes separately

      I don't see any other way than to treat each set of points independantly?

      find the scale that would most evenly distribute the data points along the axis.

      I don't no how to access "most evenly distributed"?

      The input values may be clumped or unevenly distributed; and whatever scaling you apply, the output values will, mathematically, be proportionally the same.

      I'm just not seeing how to tackle this at all.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
      In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        BrowserUk:

        For evenly distributed, I was meaning choosing the distribution that most evenly spreads out the points. An exponential distribution on a linear axis will bunch everything up to the left, for example. Doing all the work to find out what "evenly distributed" is would be a headache. I hacked something together this morning that worked to select between linear and logarithmic in the number series you provided. To figure out the most "evenly distributed" version, I simply counted the number of points to the left of the midpoint and compared that to the number of points provided, selecting the series where the difference was the smallest.

        From memory, it went something like:

        sub check_list { my $r = shift; my ($min, $max) = minmax(@$r); my $ctr_lin = ($min+$max)/2; my $ctr_log = (log($min)+log($max))/2; my ($cnt_lin, $cnt_log)=(0,0); for (@$r) { ++$cnt_lin if $_ < $ctr_lin; ++$cnt_log if $_ < $ctr_log; } my $error_lin = abs($ctr_lin - @$r/2); my $error_log = abs($ctr_log - @$r/2); return $error_lin < $error_log ? "linear" : "log"; }

        Update: I mentioned treating the axes separately, because some people were mentioning curve fitting (IIRC) which implied (to me) using both axes at the same time.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.