in reply to Data range detection?

You do a scatterplot with the data as is, on a logscale, on a log-log-scale (potentially other transformations) and then pick the one that looks "most linear", ie the one with the best fit after linear regression.

Replies are listed 'Best First'.
Re^2: Data range detection?
by BrowserUk (Patriarch) on Apr 13, 2015 at 07:39 UTC

    I was really looking for an automated method that would give a 'reasonable result' for most inputs, without human intervention.

    Maybe it's not possible, but it is worth asking. How to program "looks right"?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      I may be thinking of this too simplistically, but could you not do something along the lines of:

      For each type of proposed pattern,

      • Specify functions Fx and Fy as the functions to apply to the x and y values, respectively.
      • Given that for a straight line, (y2 - y1) = m(x2 - x1), use the equation of a line in the form ( Fy(y2) - Fy(y1) ) = m ( Fx(x2) - Fx(x1) )
      • Use the first and last data values to compute a value m
      • Compute the estimated values for the internal data points
      • Compute the error of the estimated values
      • Select proposed pattern based on the one providing the smallest error

      Is this a realistic idea, or am I thinking too simply?

      Hope that helps.

        Select proposed pattern based on the one providing the smallest error.... Is this a realistic idea, or am I thinking too simply?

        No. I don't think you are.

        The first thing that popped into my mind when I read your post was: chi-square test.

        Not quite sure how (or which variation) to use it yet, but I think could be a starting point. Thanks.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      There are various options to measure the "goodness of fit". It is at times also highly debated what the best criterion is. You could look for example at the R^2 of a linear fit and choose the scale with the biggest one.

        There are various options to measure the "goodness of fit"

        The problem is that to do a goodness of fit calculation, you need two sets of data: the actual & expected.

        The only two sets that make any sense (to me at least) are the pre-scaled and post-scaled sets; but the correlation between those will (should) be perfect whichever scaling method is used, since the latter is derived mathematically from the former.

        I can't see what other 'expected' values you could use?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked