roadnottaken has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am having a mysterious problem that I can't figure out. I am using Statistics::TTest to calculate p-values for several thousand pairs of distributions of numbers. I'm using these p-values to create a volcano plot, and when I plot the p-values I observe a strange artifact where many of the points are getting the same p-value. After some sleuthing, I can re-create the phenomenon with the four pairs of numbers in the code below. When I calculate p-values for these pairs in Excel, the values are all quite distinct (different by several orders of magnitude) but using Statistics::TTest I'm getting EXACTLY the same value for each pair.

The p-values are very small (around 1.6e-12) so I wonder if this ins't some sort of precision issue... but I can't figure it out. If you run the code below, it will display four identical p-values (T-test probabilities, "t_prob") despite the fact that the true p-values range from 1.7e-19 to 2.8e-29.

I have tried doing something similar using Statistics::Distributions but I had the same problem, however I think Statistics::TTest relies on Statistics::Distributions for these calculations. I was not able to find any other modules that perform this calculation.

I should note that the vast majority (99%) of pairs of distributions are getting correct p-values, it's just a handful that produce this strange collision on a wrong value.

Does anyone out there have any insight into what's going wrong? Help is very much appreciated, thanks!

#!/usr/bin/perl -w use strict; use Statistics::TTest; my %datasets = (); @{$datasets{a}[0]} = (0.722466024,0.925999419,1,1.049630768,1.05658352 +8,1.10433666,1.13093087,1.150559677,1.220329955,1.316145742,1.3334237 +34,1.63691458,1.691534165,0.713695815,0.815575429,0.918386234,0.92599 +9419,0.941106311,0.948600847,0.970853654,0.98550043,0.98550043,1,1.02 +8569152,1.042644337,1.10433666,1.117695043,1.269033146,1.286881148,1. +298658316,1.575312331); @{$datasets{a}[1]} = (-0.49410907,-0.358453971,-0.321928095,-0.2863041 +85,-0.200912694,-0.200912694,-0.168122759,-0.120294234,-0.120294234,- +0.104697379,-0.104697379,-0.074000581,-0.058893689,-0.577766999,-0.51 +4573173,-0.514573173,-0.49410907,-0.358453971,-0.358453971,-0.3040061 +87,-0.251538767,-0.184424571); @{$datasets{b}[0]} = (-0.434402824,-0.286304185,-0.251538767,-0.058893 +689,-0.043943348,-0.043943348,0.084064265,0.163498732,0.23878686,0.23 +878686,0.310340121,0.839959587,0.879705766,-0.556393349,-0.268816758, +-0.251538767,-0.152003093,-0.104697379,-0.089267338,-0.029146346,-0.0 +29146346,0,0.070389328,0.084064265,0.097610797,0.124328135,0.13750352 +4,0.189033824,0.189033824,0.214124805,0.214124805,0.214124805,0.32192 +8095,0.333423734,0.367371066,0.40053793,0.411426246,0.443606651,0.516 +015147,0.669026766,0.713695815); @{$datasets{b}[1]} = (0.782408565,0.799087306,0.82374936,0.887525271,0 +.925999419,0.933572638,0.956056652,0.97819563,0.98550043,1.021479727, +1.084064265,1.097610797,1.13093087,1.150559677,1.15704371,1.176322773 +,1.182692298,1.22650853,1.286881148,1.292781749,1.310340121,1.4594316 +19,1.485426827,1.521050737,1.59454855,1.695993813,1.713695815,1.72683 +1217,0.40053793,0.411426246,0.59454855,0.925999419,0.941106311,0.9486 +00847,0.98550043,1.028569152,1.070389328,1.117695043,1.124328135,1.22 +0329955,1.316145742,1.744161096); @{$datasets{c}[0]} = (-0.043943348,-0.029146346,-0.01449957,0.02856915 +2,0.097610797,0.124328135,0.176322773,0.201633861,0.263034406,-0.8624 +96476,-0.104697379,0.084064265,0.084064265,0.084064265,0.124328135,0. +124328135,0.163498732,0.263034406,0.275007047,0.286881148,0.321928095 +,0.333423734); @{$datasets{c}[1]} = (-2.64385619,-2.556393349,-2.473931188,-2.3959286 +76,-2.395928676,-2.395928676,-2.321928095,-2.321928095,-2.321928095,- +2.251538767,-2.251538767,-2.184424571,-2.120294234,-2,-0.535331733,-1 +.64385619,-1.556393349,-1.514573173,-1.514573173,-1.473931188,-1.4344 +02824,-1.434402824,-1.395928676,-1.395928676,-1.395928676,-1.39592867 +6,-1.358453971,-1.358453971,-1.358453971,-1.358453971,-1.358453971,-1 +.321928095,-1.286304185,-1.286304185,-1.286304185,-1.251538767,-1.217 +591435,-1.120294234,-1); @{$datasets{d}[0]} = (0.933572638261024,0.948600847493356,0.9486008474 +93356,0.970853654340483,0.978195629681652,1.111031312388740,1.1505596 +76575380,1.416839741912830,0.731183241572200,0.790772037862000,0.8155 +75428862573,0.855989697308481,0.871843648509318,0.895302621333307,0.9 +33572638261024,0.941106310946431,0.948600847493356,0.956056652412403, +0.970853654340483,0.992768430768924,1.000000000000000,1.0635029423061 +60,1.226508529808680,1.269033146455240,1.298658315564520,1.7048719644 +56350); @{$datasets{d}[1]} = (-0.473931188332412,0.028569152196771,0.042644337 +408494,0.056583528366368,0.070389327891398,0.084064264788475,0.097610 +796626422,0.111031312388744,0.454175893185802,0.454175893185802,-0.51 +4573172829758,-0.268816758427800,-0.168122758808327,-0.13606154957602 +8,-0.043943347587597,0.014355292977070,0.111031312388744,0.1243281350 +02202,0.137503523749935,0.176322772640463,0.238786859587116,0.2509615 +73533219,0.344828496997441); foreach my $dataset (sort keys %datasets) { my $ttest = new Statistics::TTest; $ttest->load_data(\@{$datasets{$dataset}[0]},\@{$datasets{$dataset +}[1]}); print "$dataset - t_prob:\t$ttest->{t_prob}\n\n"; }
  • Comment on strange collision when calculating p-values using Statistics::TTest
  • Download Code

Replies are listed 'Best First'.
Re: strange collision when calculating p-values using Statistics::TTest
by zentara (Cardinal) on Jun 23, 2012 at 18:35 UTC
    Aside from the possibility that there is a bug in the Statistics::TTest module, my first observation would be that Statistics::TTest leaves behind some leftover junk possibly by using globals within the module. There is a faq for this, read perldoc -q clear where MJD shows how to clear out all temporary package variables.

    Usually a module's author will typically stick to using an array or hash for storage in the module. MJD's scrubber gets them all, but you may only need something like this at the end of your loop, to undef your temp arrays.

    foreach my $dataset (sort keys %datasets) { my $ttest = new Statistics::TTest; $ttest->load_data(\@{$datasets{$dataset}[0]},\@{$datasets{$dataset +}[1]}); print "$dataset - t_prob:\t$ttest->{t_prob}\n\n"; undef @$ttest; # clear out old array data undef %$ttest; # get old hash data }

    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku ................... flash japh
Re: strange collision when calculating p-values using Statistics::TTest
by ReturnOfThelonious (Beadle) on Jun 25, 2012 at 02:30 UTC
    The p-values are very small (around 1.6e-12) so I wonder if this ins't some sort of precision issue.
    It appears to be a limitation of the tprob function in Statistics::Distributions. The approximation isn't good enough for such small values.
      Thanks for the comments. In case anyone else runs into a similar problem, I have found a workaround: I have not figured out what's going on with Statistics::TTest or Statistics::Distributions, but I have found another module that works correctly with these data. As I understand it, a one-way ANOVA on a single pair of distributions is equivalent to an Student's T-test. As such, I tried using Statistics::ANOVA and I had success. Using the definitions of %datasets from the above code, the following loop will calculate correct p-values (matching the values that Excel gives):
      foreach my $dataset (sort keys %datasets) { my $aov = Statistics::ANOVA->new(); $aov->load( "$dataset\1", \@{$datasets{$dataset}[0]} ); $aov->add( "$dataset\2", \@{$datasets{$dataset}[1]} ); $str = $aov->anova(independent => 1, parametric => 1, ordinal => 0 +); print $str->{_stat}->{p_value} . "\n"; }