Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Fisher's Exact Test

by tlm (Prior)
on Aug 11, 2005 at 12:19 UTC ( [id://482919]=sourcecode: print w/replies, xml ) Need Help??
Category: Miscellaneous
Author/Contact Info /msg me
Description:

See documentation with the code.

Note that the code computes factorials in only one place, the internal subroutine _single_term, which computes the hypergeometric probability function for a 2 x 2 contingency table (specified through its cell values, a, b, c, and d). Furthermore, the only dependance of the main routine (fishers_exact) on the Math::Pari module is through _single_term. If you prefer a library other than Math::Pari or don't like my implementation of _single_term, just replace it with your own. The rest of the code should work fine.

Also note that _single_term is called at most twice for every call to fishers_exact, so relegating this computation to a separate function doesn't significantly add to fishers_exact's overhead.

The computation of factorials attempts to use Math::Pari::factorial, and switches over to Gosper's approximation if that fails. Although, in principle, this maximizes the algorithm's accuracy, the performance cost may not justify this policy. For example, for 10_000!, the relative difference between the two methods is < 1e-10, but Gosper's is much faster:

Rate pari gospers pari 604/s -- -94% gospers 9657/s 1500% --
The cost/performance only gets worse for larger factorials.

On the other hand, if you find yourself working in this regime, ask yourself whether χ2 isn't more suited to your needs than than the FET.

Needless to say, I am providing this code "as is", without any expressed or implied warranties whatsoever. Use at your own risk.

=head1 NAME

Statistics::FET - Fisher's Exact Test statistic (2x2 case)

=head1 SYNOPSIS

  use Statistics::FET 'fishers_exact';

=head1 DESCRIPTION

This module exports only one function, C<fishers_exact>, which
computes the one- or two-sided Fisher's Exact Test statistic for the 2
x 2 case.  In the following documentation I will be referring to the
following family of 2 x 2 contingency tables


                        *   |   *   | r1 = a+b
                    --------+-------+----------
                        *   |   *   | r2 = c+d
                    --------+-------+----------
                       c1   |   c2  |    N
                     = a+c  | = b+d | = a+b+c+d

The *'s mark the cells, N is the total number of points represented by
the table, and r1, r2, c1, c2 are the marginals.  As suggested by the
equalities, the letters a, b, c, d refer to the various cells (reading
them left-to-right, top-to-bottom).  Depending on context, the letter
c (for example) will refer *either* to the lower left cell of the
table *or* to a specific value in that cell.  Same for a, b, and d.

In what follows, H(x) (or more precisely, H(x; r1, r2, c1)) refers
to the hypergeometric expression

                           r1!*r2!*c1!*c2!
                -------------------------------------
                (r1+r2)!*x!*(r1-x)!*(c1-x)!*(r2-c1+x)!

(I omit c2 from the parametrization of H because c2 = r1 + r2 - c1.)

=head1 FUNCTION

=over 4

=item fishers_exact( $a, $b, $c, $d, $two_sided )

The paramater C<$two_sided> is optional.  If missing or false
C<fishers_exact> computes the one-sided FET p-value.  More
specifically, it computes the sum of H(x; a+b, c+d, a+c) for x = a to
x = min(a+b, a+c).  (If you want the sum of H(x; a+b, c+d, a+c) for x
= max(0, a-d) to x = a, then compute C<fishers_exact( $b, $a, $d, $c )
+> or
C<fishers_exact( $c, $d, $a, $b )> (these two are equivalent).)

If C<$two_sided> is true, the returned p-value will be for the two-sid
+ed
FET.

=back

=cut



use strict;
use warnings;

package Statistics::FET;
use Exporter 'import';

use Math::Pari ();

my $Tolerance = 1;
$Tolerance /= 2 while 1 + $Tolerance/2 > 1;

@Statistics::FET::EXPORT_OK = 'fishers_exact';
sub fishers_exact {
  my ( $a, $b, $c, $d, $ts ) = @_;
  my $test = $a*( $a + $b + $c + $d ) - ( $a + $b )*( $a + $c );

  return 1 if $test < 0 and $ts;
  # below here, $test < 0 implies !$ts;

  my $p_val;

  if ( $test < 0 ) {
    if ( $d < $a ) {
      $p_val = _fishers_exact( $d, $c, $b, $a, $ts, 1 );
    }
    else {
      $p_val = _fishers_exact( $a, $b, $c, $d, $ts, 1 );
    }
  }
  else {
    if ( $b < $c ) {
      $p_val = _fishers_exact( $b, $a, $d, $c, $ts, 0 );
    }
    else {
      $p_val = _fishers_exact( $c, $d, $a, $b, $ts, 0 );
    }
  }

  return $p_val;
}

sub _fishers_exact {
  my ( $a, $b, $c, $d, $ts, $complement ) = @_;
  die "Bad args\n" if $ts && $complement;

  my ( $aa, $bb, $cc, $dd ) = ( $a, $b, $c, $d );
  my $first = my $delta = _single_term( $aa, $bb, $cc, $dd );
  my $p_val = 0;

  {
    $p_val += $delta;
    last if $aa < 1;
    $delta *= ( ( $aa-- * $dd-- )/( ++$bb * ++$cc ) );
    redo;
  }

  if ( $ts ) {
    my $m = $b < $c ? $b : $c;
    ($aa, $bb, $cc, $dd ) = ( $a + $m, $b - $m, $c - $m, $d + $m );
    $delta = _single_term( $aa, $bb, $cc, $dd );
    my $bound = -$Tolerance;
    while ( $bound <= ( $first - $delta )/$first && $aa > $a ) {

      $p_val += $delta;
      $delta *= ( ( $aa-- * $dd-- )/( ++$bb * ++$cc ) );
    }
  }
  elsif ( $complement ) {
    $p_val = 1 - $p_val + $first;
  }

  return $p_val;
}

sub _single_term {
  my ( $a, $b, $c, $d ) = @_;
  my ( $r1, $r2 ) = ($a + $b, $c + $d);
  my ( $c1, $c2 ) = ($a + $c, $b + $d);
  my $N = $r1 + $r2;

  return  exp( _ln_fact( $r1 ) + _ln_fact( $r2 ) +
               _ln_fact( $c1 ) + _ln_fact( $c2 ) -
               _ln_fact( $N ) -
               ( _ln_fact( $a ) + _ln_fact( $b ) +
                 _ln_fact( $c ) + _ln_fact( $d ) ) );
}

{
  my $two_pi;
  my $pi_over_3;
  my $half;

  BEGIN {
    $two_pi    = Math::Pari::PARI( 2 * atan2 0, -1 );
    $pi_over_3 = Math::Pari::PARI( atan2( 0, -1 )/3 );
    $half      = Math::Pari::PARI( 0.5 );
  }

  sub _ln_fact {
    my $n = Math::Pari::PARI( shift() );
    die "Bad args to _ln_fact: $n" if $n < 0;
    my $ln_fact;
    eval {
      $ln_fact = log Math::Pari::factorial( $n );
    };
    if ( $@ ) {
      die $@ unless $@ =~ /\QPARI:   ***   exponent overflow/;
      # Gosper's approximation; from
      # http://mathworld.wolfram.com/StirlingsApproximation.html
      $ln_fact = $half * log( $two_pi*$n + $pi_over_3 )
                 +  $n * log( $n )
                 -  $n;
    }
    return $ln_fact;
  }
}

__END__
Replies are listed 'Best First'.
Re: Fisher's Exact Test
by BioLion (Curate) on Jul 15, 2009 at 10:53 UTC

    I was really desperate for a perl implemetation of FET, Text::NSP::Measures::2D::Fisher just didn't stand up in any way - the 'basic usage' they give doesn't even work, even once you have fixed their syntax errors and got it to compile!

    So obviously i end up looking in perlmonks for something better though through! So, following up from resonator's comments, i made some test code for the posted code :

    And it seemed to backup what resonator said, but tlm's code did get the answer right for *some* of the tests, so i was encouraged and made the following changes (extremely minor!). Mainly (erm... only) adding in :

    TEST: ## simplified test equation my $test = $a*$d - $b*$c; ## introduced switching around of input pairs if ($test < 0 && $ts){ ($a, $b, $c, $d, ) = ($b, $a, $d, $c,); goto TEST; }

    right after the 'test' bit that had people confused!So, now it passes all tests!

    So here is some updated code for anyone who is interested!

    Further tests|opinions are always welcome! I was surprised there was not more FET mods there, at least that i could find!

    Just a something something...
Re: Fisher's Exact Test
by resonator (Initiate) on Apr 18, 2007 at 17:41 UTC
    Hi.

    Have you checked the results of your code for Fisher's Exact Test? I don't think it produces correct results.

    Here's a utility that produces results identical to those I calculate: http://www.exactoid.com/fisher/index.php

    This expression at the beginning of your code seems odd:

    $test = $a*($a+$b+$c+$d ) - ($a+$b)*($a+$c);
    It simplifies to: $test = $a*$d - $b*$c;

    I don't follow why you have this:

    return 1 if $test < 0 and $ts;
    For example, if a = 1, b = 50, c = 10 and d = 5, P(one-sided) = P(two-sided) = 1.4386e-07, not 1.

    Nevertheless, thanks for the tip about calculating factorials only once. Let me know if I've misunderstood something.

      I think it is correct. It is justified because $ts implies one-side test, so the p-value is indeed 1. Hao
        I think it is wrong. $ts means two-side, not one-side.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: sourcecode [id://482919]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2024-03-28 08:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found