OK, here's your analysis (w/ picture!)

"Use the right tool for the job," certainly applies here. Therefore, my Perl program is going to fire up R from The R Project for Statistical Computing, feed R our data, and run a regression analysis on it.

Here's the code. (Note: I'm truncating the data for conciseness.)

#!/usr/bin/perl -wl

use strict;
use File::Temp qw( tempfile );

my %rep_stats=(1=>24694, 2=>23551, 0=>22855, ... -41=>1, );

# only keep 0 < XP < 100 because we want the more mainstream
# values and not the far-out ones

my @xp_sorted = grep { $_>0 && $_<100 } sort { $a <=> $b }
                keys %rep_stats;

# generate our R commands

(my $r_commands = <<EOF) =~ s/^    //mg;
    xp <- scan()
    @xp_sorted

    count <- scan()
    @{[ @rep_stats{@xp_sorted} ]}

    summary(lm(log10(count) ~ xp + I(xp^2)))
EOF

# now, we stuff our R commands into a tempfile,
# which we'll use as STDIN

my $tmp = tempfile()
    or die "can't open tempfile: $!";
print $tmp $r_commands;
seek $tmp, 0, 0 or die "can't seek to BOF: $!";
open STDIN, ">&", $tmp or die "can't dup tmp->STDIN: $!";

# finally, we exec R, which will read our commands
# from STDIN (the temp file will be deleted automatically
# when the program exits)

my @cmd = qw(R --no-save --no-init-file --no-restore-data --slave);
exec @cmd;
die "couldn't exec @cmd : $1";  # should never get here
[download]

Now, let's run the above program and see the output:

Read 99 items
Read 99 items

Call:
lm(formula = log10(count) ~ xp + I(xp^2))

Residuals:
      Min        1Q    Median        3Q       Max 
-0.111459 -0.032358  0.004959  0.024387  0.109470 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.442e+00  1.239e-02  358.49   <2e-16 ***
xp          -4.194e-02  5.719e-04  -73.33   <2e-16 ***
I(xp^2)      1.467e-04  5.541e-06   26.48   <2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 0.04027 on 96 degrees of freedom
Multiple R-Squared: 0.9975,    Adjusted R-squared: 0.9974 
F-statistic: 1.889e+04 on 2 and 96 DF,  p-value: < 2.2e-16
[download]

Woohoo! It looks like we have a good fit. Converting our fitted model into a Perl function that estimates the count of nodes with a given XP, we get the following:

sub estimate_count_from_xp($) {
    my $xp = shift;
    10 ** ( 4.442 - 4.194e-2 * $xp + 1.467e-4 * $xp**2 );
}
[download]

(Because I fitted the model against log10(count), we had to exponentiate the resulting formula to get an estimation function for count.)

Just to see how good our model is, take a look at this plot comparing the actual values (dots) versus the estimated values (line). That's pretty much "on the money."

Cheers,
Tom

Tom Moertel : Blog / Talks / LectroTest / PXSL / Coffee / Movie Rating Decoder

Comment on OK, here's your analysis (w/ picture!) Select or Download Code

Replies are listed 'Best First'.
Re: OK, here's your analysis (w/ picture!) by zby (Vicar) on Sep 15, 2004 at 07:08 UTC
++ for the technique. But the result seems pretty strange - the `1.467e-4 * $xp**2` is the dominant part in the exponent so it seems that the count should grow with the xp from some point.	[reply] [d/l]
Re^2: OK, here's your analysis (w/ picture!) by tmoertel (Chaplain) on Sep 15, 2004 at 07:32 UTC
You're right that the quadratic term will, eventually, dominate. However, for the range I considered (0 < XP < 100), adding that term results in a slightly better fit. But, the fit is nearly as good without it, and so for interpretive purposes (instead of get-the-best-fit purposes), dropping the quadratic term makes for a better model: Read 99 items Read 99 items Call: lm(formula = log10(count) ~ xp) Residuals: Min 1Q Median 3Q Max -0.16823 -0.10095 -0.01733 0.07757 0.25553 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 4.194851 0.023379 179.43 <2e-16 * xp -0.027268 0.000406 -67.17 <2e-16 * --- Signif. codes: 0 `*' 0.001 `' 0.01 `' 0.05 `.' 0.1 ` ' 1 Residual standard error: 0.1154 on 97 degrees of freedom Multiple R-Squared: 0.979, Adjusted R-squared: 0.9787 F-statistic: 4512 on 1 and 97 DF, p-value: < 2.2e-16 [download] With this model, our estimating function is as follows: `sub estimate_count_from_xp($) { my $xp = shift; 10 * ( 4.195 - 0.2727 * $xp ); }` [download] From this, it's easy to see that we have classic exponential decay w.r.t. XP. Does this match your intuition? Tom Moertel : Blog / Talks / LectroTest / PXSL / Coffee / Movie Rating Decoder	[reply] [d/l] [select]
Re^3: OK, here's your analysis (w/ picture!) by zby (Vicar) on Sep 15, 2004 at 12:50 UTC
Thank's. I thought I was missing something.	[reply]