in reply to Re^2: Data range detection?
in thread Data range detection?

There are various options to measure the "goodness of fit". It is at times also highly debated what the best criterion is. You could look for example at the R^2 of a linear fit and choose the scale with the biggest one.

Replies are listed 'Best First'.
Re^4: Data range detection?
by BrowserUk (Patriarch) on Apr 13, 2015 at 17:35 UTC
    There are various options to measure the "goodness of fit"

    The problem is that to do a goodness of fit calculation, you need two sets of data: the actual & expected.

    The only two sets that make any sense (to me at least) are the pre-scaled and post-scaled sets; but the correlation between those will (should) be perfect whichever scaling method is used, since the latter is derived mathematically from the former.

    I can't see what other 'expected' values you could use?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
    In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

      The "expected" data set is the linear fit. The following script uses linear regression and the R^2 metrics to calculate a measure of fit for your three datasets:

      use strict; use warnings; use Statistics::LineFit; sub fit { my $fit = Statistics::LineFit->new(); $fit->setData( @_ ); return $fit->rSquared(); } my @data = ( [ qw( 5 5 34 44 114 169 177 184 270 339 361 364 442 511 5 +30 554 555 587 709 709 735 778 791 859 871 899 903 926 933 952 ) ], [ 0.5, 1, 3, 7, 15, 31, 63, 127, 255, 511, 1023, 2047, 4095, +8191, 16383, 32767, 65535, 131071, 262143, 524287, 1048575, 2097151, 4194303, 8388607, 16777215, 33554431, 671 +08863, 134217727, 268435455, 536870911, 1073741823 ], [ 1.713125e-005, 1.748086e-006, 2.101463e-006, 1.977405e-006, + 3.597675e-006, 3.725492e-006, 3.924736e-006, 2.902199e-006, 3.988645e-006, 8.210367e-006, 3.360837e-006, 5.202907e-006, + 7.082570e-006, 8.778026e-006, 7.079562e-005, 9.100576e-005, 5.258545e-005, 9.292677e-005, 1.789815e-004, 2.113948e-003, + 7.229146e-004, 1.428995e-003, 2.742045e-003, 5.552746e-003, 1.822390e-002, 2.220999e-002, 4.316067e-002, 8.876963e-002, + 1.751072e-001, 3.494051e-001, 7.155960e-001, 1.347822e+000 ] ); print " linear loglinear loglog\n"; for my $d (@data) { my @x = 1..@$d; my @logx = map log, @x; my @logd = map log, @$d; printf "%10.2f %10.2f %10.2f\n", fit( \@x, $d), fit( \@x, \@logd), f +it( \@logx, \@logd ); }

      The result is

      linear loglinear loglog 0.99 0.69 0.95 0.26 1.00 0.86 0.26 0.90 0.58

      which shows that the first data set describes a linear relationship while the others are more of log type (the largest R^2 wins). If you have a stats package at hand (or even Excel only) you can do the same thing and visualize the results.

        Sorry, but unless my eye's are deceiving me (quite possible), but you don't appear to be fitting the data at all:

        21 my @x = 1..@$d; ### Takes the values 1..3 +0, 1..31, and 1..32 22 my @logx = map log, @x; ### is the logs of those +sequential ranges 23 my @logd = map log, @$d; ### the loglogs of those +sequential ranges. 24 printf "%10.2f %10.2f %10.2f\n", fit( \@x, $d), fit( \@x, \@logd) +, fit( \@logx, \@logd );

        The actual data is never passed to the fit sub?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

        Sorry hdb, it seems it was more than just my eye's giving me trouble last night. And given it was you, I should have known better.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
        In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked