Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Calculating your basic $constant, $slope, and $error terms for a time series distribution

by tphyahoo (Vicar)
on Jul 18, 2006 at 14:07 UTC ( [id://562015]=perlquestion: print w/replies, xml ) Need Help??

tphyahoo has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am to run statistics on a large body of time series data, tracking different types of widgets purchased over time. I want to identify widget types that are becoming more popular.

I know a little statistics, and I think what I want, at least for starters, is the "constant, slope, and error" correlation coefficients for my various distributions.

In other words, snipping from code below:

# want $constant, $slope, and $error coefficients for regression equat +ion fitting this data, where the distribution line is approximated by # Y = $constant + $slope * x + $error # Y = Dependent Variable (eg, widgets purchased at point in time) # $constant = Y-axis Intercept # $slope = Slope of the regression line # x is Independent Variable(eg, time) # $error = error factor, should be large for random distributions, sma +ll for # strongly correlated distrubions # See http://www.tufts.edu/~gdallal/slr.htm #dummy for now -- what's the best way to do this?

The error factor tells me which distributions I can throw out. (Error factor will be large for random distributions.)

The other two factors wil tell me how popular the widget is in comparison with other widgets, and how quickly it is increasing (or decreasing) in popularity.

I did a little test script with distributions for "random", "increasing slowly", and "increasing quickly." (Tests fail, but concretize what I want.)

Current output is:

$ perl trend.t slow_increase distribution, constant 0, slope 0, error 0 random distribution, constant 0, slope 0, error 0 fast_increase distribution, constant 0, slope 0, error 0 not ok 1 - $slow_increase_error < $random_error # Failed test '$slow_increase_error < $random_error' # in trend.t at line 96. not ok 2 - $fast_increase_error < $random_error # Failed test '$fast_increase_error < $random_error' # in trend.t at line 97. not ok 3 - $slow_increase_slope < $fast_increase_slope # Failed test '$slow_increase_slope < $fast_increase_slope' # in trend.t at line 102. 1..3 # Looks like you failed 3 tests of 3. $
The bit that I need help with is sub calculate_regression_coefficients. Which is just dummy code right now.

Now, this is in a way a question about statistics as well as about perl. With statistics, like with perl, there's more than one way to do it: in this case, more than one method to get correlation coefficients to fit a distribution. Whatever, I just want the simplest, most vanilla, least computationally intensive way to do this... whatever that is.

There are a lot of statistics modueles on the CPAN, and I assume there's something out there that covers what I need. Can someone point me in the right direction?

Thanks in advance!

#!/usr/bin/perl use strict; use warnings; use Test::More qw(no_plan); my $distributions = { random => { distribution => { 1 => 3, 2 => 5, 3 => 2, 4 => 7, 5 => 1, 6 => 3, 7 => 2, 8 => 6, 9 => 1, 10 => 1, 11 => 3, 12 => 5, 13 => 6, 14 => 2, 15 => 8, 16 => 9, 17 => 1, 18 => 4, 19 => 5, 20 => 6 } }, slow_increase => { distribution => { 1 => 1, 2 => 1, 3 => 3, 4 => 2, 5 => 3, 6 => 2, 7 => 3, 8 => 4, 9 => 3, 10 => 2, 11 => 5, 12 => 4, 13 => 6, 14 => 5, 15 => 7, 16 => 4, 17 => 8, 18 => 6, 19 => 9, 20 => 8 } }, fast_increase => { distribution => { 1 => 2, 2 => 2, 3 => 6, 4 => 4, 5 => 6, 6 => 4, 7 => 6, 8 => 8, 9 => 6, 10 => 4, 11 => 10, 12 => 8, 13 => 12, 14 => 10, 15 => 14, 16 => 8, 17 => 16, 18 => 12, 19 => 18, 20 => 16 } } }; for my $distribution_name ( keys %$distributions ) { my $distribution = $distributions->{$distribution_name}; my $regression_coefficients = calculate_regression_coefficients($di +stribution); my ($constant, $slope, $error) = map { $regression_coefficients->{$_ +} } qw(constant slope error); print "$distribution_name distribution, constant $constant, slope $s +lope, error $error\n"; $distributions->{$distribution_name}->{constant}=$constant; $distributions->{$distribution_name}->{slope} =$slope; $distributions->{$distribution_name}->{error} =$error; } # error of random distribution should be greater than either of the ot +her two distributions my $random_error = $distributions->{random}->{error}; my $slow_increase_error = $distributions->{slow_increase}->{error}; my $fast_increase_error = $distributions->{fast_increase}->{error}; ok( $slow_increase_error < $random_error , '$slow_increase_error < $r +andom_error'); ok( $fast_increase_error < $random_error , '$fast_increase_error < $r +andom_error'); #fast increase slope should be greater than slow increase slope my $slow_increase_slope = $distributions->{slow_increase}->{slope}; my $fast_increase_slope = $distributions->{fast_increase}->{slope}; ok( $slow_increase_slope < $fast_increase_slope, '$slow_increase_slope + < $fast_increase_slope' ); # want $constant, $slope, and $error coefficients for regression equat +ion fitting this data, where the distribution line is approximated by # Y = $constant + $slope * x + $error # Y = Dependent Variable (eg, widgets purchased at point in time) # $constant = Y-axis Intercept # $slope = Slope of the regression line # x is Independent Variable(eg, time) # $error = error factor, should be large for random distributions, sma +ll for # strongly correlated distrubions # See http://www.tufts.edu/~gdallal/slr.htm #dummy for now -- what's the best way to do this? sub calculate_regression_coefficients { my $distribution = shift or die "no distribution"; {constant => 0, slope => 0, error => 0} }
  • Comment on Calculating your basic $constant, $slope, and $error terms for a time series distribution
  • Select or Download Code

Replies are listed 'Best First'.
Re: Calculating your basic $constant, $slope, and $error terms for a time series distribution
by bobf (Monsignor) on Jul 18, 2006 at 14:57 UTC

    It sounds like Statistics::LSNoHistory might be what you're looking for. It will calculate a Least-Squares linear regression for a given set of input data and return the slope (m) and intercept (k) for a line of the form y = m*x + k. It will also calculate Pearson's r correlation coefficient, which is simply a measure of how well the line fits your data. If by "error" you mean variance, you can calculate that easily after the equation of the line is known (simply sum the squares of the differences of the y-values for each data point and the value predicted by the equation, then divide by the number of data points use the variance_x and variance_y methods).

    Update: here is some code that uses your example data:

    while( my ( $distrib_type, $href ) = each %{ $distributions } ) { my $regobj = Statistics::LSNoHistory->new( points => [ %{ $href->{distribution} } ] ); print "\nData for $distrib_type:\n"; printf("Slope: %.2f\n", $regobj->slope); printf("Intercept %.2f\n", $regobj->intercept); printf("Correlation Coefficient: %.2f\n", $regobj->pearson_r); printf("Variance (y): %.2f\n", $regobj->variance_y); }

    Output:

    Data for fast_increase: Slope: 0.72 Intercept 1.05 Correlation Coefficient: 0.89 Variance (y): 22.78 Data for slow_increase: Slope: 0.36 Intercept 0.53 Correlation Coefficient: 0.89 Variance (y): 5.69 Data for random: Slope: 0.12 Intercept 2.77 Correlation Coefficient: 0.28 Variance (y): 6.11

      That was extremely helpful!
Re: Calculating the regression equation for a distribution
by explorer (Chaplain) on Jul 18, 2006 at 14:29 UTC
      I looked at this module, but I don't understand the nomenclature it uses.

      My limited background in statistics means that I get the overall methodology, more or less, but am unfamiliar with the jargon.

      For example, what is rsq?

      And what is the array of theta coefficients?

      If I can use this module to get the coefficients I want -- just a constant, a slope, and an error term -- great, but could someone show me how?

        I would be a bit less generous: the documentation is poor. Abbreviations and symbols should always be explained. Take a look at just about any paper in a refereed journal.

        rsq is probably the square of the correlation coefficient. The θ array is probably the coefficients for xi.(Note that that's supposed to be a subscript, but it's rendered, at least in my browser, as a superscript).

        emc

        e(π√−1) = −1

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://562015]
Approved by Corion
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2024-04-19 19:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found