comment on

Hi Monks, I am to run statistics on a large body of time series data, tracking different types of widgets purchased over time. I want to identify widget types that are becoming more popular.

I know a little statistics, and I think what I want, at least for starters, is the "constant, slope, and error" correlation coefficients for my various distributions.

In other words, snipping from code below:

# want $constant, $slope, and $error coefficients for regression equat
+ion fitting this data, where the distribution line is approximated by
# Y = $constant + $slope * x + $error
# Y = Dependent Variable (eg, widgets purchased at point in time)
# $constant = Y-axis Intercept
# $slope = Slope of the regression line
# x is Independent Variable(eg, time)
# $error = error factor, should be large for random distributions, sma
+ll for 
# strongly correlated distrubions
# See http://www.tufts.edu/~gdallal/slr.htm
#dummy for now -- what's the best way to do this?
[download]

The error factor tells me which distributions I can throw out. (Error factor will be large for random distributions.)

The other two factors wil tell me how popular the widget is in comparison with other widgets, and how quickly it is increasing (or decreasing) in popularity.

I did a little test script with distributions for "random", "increasing slowly", and "increasing quickly." (Tests fail, but concretize what I want.)

Current output is:

$ perl trend.t
slow_increase distribution, constant 0, slope 0, error 0
random distribution, constant 0, slope 0, error 0
fast_increase distribution, constant 0, slope 0, error 0
not ok 1 - $slow_increase_error < $random_error
#   Failed test '$slow_increase_error < $random_error'
#   in trend.t at line 96.
not ok 2 - $fast_increase_error < $random_error
#   Failed test '$fast_increase_error < $random_error'
#   in trend.t at line 97.
not ok 3 - $slow_increase_slope < $fast_increase_slope
#   Failed test '$slow_increase_slope < $fast_increase_slope'
#   in trend.t at line 102.
1..3
# Looks like you failed 3 tests of 3.
$
[download]

The bit that I need help with is sub calculate_regression_coefficients. Which is just dummy code right now.

Now, this is in a way a question about statistics as well as about perl. With statistics, like with perl, there's more than one way to do it: in this case, more than one method to get correlation coefficients to fit a distribution. Whatever, I just want the simplest, most vanilla, least computationally intensive way to do this... whatever that is.

There are a lot of statistics modueles on the CPAN, and I assume there's something out there that covers what I need. Can someone point me in the right direction?

Thanks in advance!

#!/usr/bin/perl
use strict;
use warnings;
use Test::More qw(no_plan);

my $distributions = { random =>
              { distribution => {
                     1 => 3,
                     2 => 5,
                     3 => 2,
                     4 => 7,
                     5 => 1,
                     6 => 3,
                     7 => 2,
                     8 => 6,
                     9 => 1,
                     10 => 1,
                     11 => 3,
                     12 => 5,
                     13 => 6,
                     14 => 2,
                     15 => 8,
                     16 => 9,
                     17 => 1,
                     18 => 4,
                     19 => 5,
                     20 => 6
                    }
              },
              slow_increase =>
              { distribution => {
                     1 => 1,
                     2 => 1,
                     3 => 3,
                     4 => 2,
                     5 => 3,
                     6 => 2,
                     7 => 3,
                     8 => 4,
                     9 => 3,
                     10 => 2,
                     11 => 5,
                     12 => 4,
                     13 => 6,
                     14 => 5,
                     15 => 7,
                     16 => 4,
                     17 => 8,
                     18 => 6,
                     19 => 9,
                     20 => 8
                    }
              },
              fast_increase =>
              { distribution => {
                     1 => 2,
                     2 => 2,
                     3 => 6,
                     4 => 4,
                     5 => 6,
                     6 => 4,
                     7 => 6,
                     8 => 8,
                     9 => 6,
                     10 => 4,
                     11 => 10,
                     12 => 8,
                     13 => 12,
                     14 => 10,
                     15 => 14,
                     16 => 8,
                     17 => 16,
                     18 => 12,
                     19 => 18,
                     20 => 16
                    } }
            };

for  my $distribution_name ( keys %$distributions ) {
  my $distribution = $distributions->{$distribution_name};

  my $regression_coefficients =  calculate_regression_coefficients($di
+stribution);
  my ($constant, $slope, $error) = map { $regression_coefficients->{$_
+} } qw(constant slope error);

  print "$distribution_name distribution, constant $constant, slope $s
+lope, error $error\n";

  $distributions->{$distribution_name}->{constant}=$constant;
  $distributions->{$distribution_name}->{slope}   =$slope;
  $distributions->{$distribution_name}->{error}   =$error;
}

# error of random distribution should be greater than either of the ot
+her two distributions
my $random_error = $distributions->{random}->{error};
my $slow_increase_error = $distributions->{slow_increase}->{error};
my $fast_increase_error = $distributions->{fast_increase}->{error};
ok( $slow_increase_error < $random_error  , '$slow_increase_error < $r
+andom_error');
ok( $fast_increase_error < $random_error  , '$fast_increase_error < $r
+andom_error');

#fast increase slope should be greater than slow increase slope
my $slow_increase_slope = $distributions->{slow_increase}->{slope};
my $fast_increase_slope = $distributions->{fast_increase}->{slope};
ok( $slow_increase_slope < $fast_increase_slope, '$slow_increase_slope
+ < $fast_increase_slope' );

# want $constant, $slope, and $error coefficients for regression equat
+ion fitting this data, where the distribution line is approximated by
# Y = $constant + $slope * x + $error
# Y = Dependent Variable (eg, widgets purchased at point in time)
# $constant = Y-axis Intercept
# $slope = Slope of the regression line
# x is Independent Variable(eg, time)
# $error = error factor, should be large for random distributions, sma
+ll for 
# strongly correlated distrubions
# See http://www.tufts.edu/~gdallal/slr.htm
#dummy for now -- what's the best way to do this?
sub calculate_regression_coefficients {
  my $distribution = shift or die "no distribution";
  {constant => 0, slope => 0, error => 0}
}
[download]

In reply to Calculating your basic $constant, $slope, and $error terms for a time series distribution by tphyahoo

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.