mrguy123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks
I am writing a web based bioinformatic program that uses the hypergeometric distribution. I use the code from this perl node
#!/usr/bin/perl -w use strict; sub logfact { return gammln(shift(@_) + 1.0); } sub hypergeom { # There are m "bad" and n "good" balls in an urn. # Pick N of them. The probability of i or more successful selection +s: # (m!n!N!(m+n-N)!)/(i!(n-i)!(m+i-N)!(N-i)!(m+n)!) my ($n, $m, $N, $i) = @_; my $loghyp1 = logfact($m)+logfact($n)+logfact($N)+logfact($m+$n-$N) +; my $loghyp2 = logfact($i)+logfact($n-$i)+logfact($m+$i-$N)+logfact( +$N-$i)+logfact($m+$n); return exp($loghyp1 - $loghyp2); } sub gammln { my $xx = shift; my @cof = (76.18009172947146, -86.50532032941677, 24.01409824083091, -1.231739572450155, 0.12086509738661e-2, -0.5395239384953e-5); my $y = my $x = $xx; my $tmp = $x + 5.5; $tmp -= ($x + .5) * log($tmp); my $ser = 1.000000000190015; for my $j (0..5) { $ser += $cof[$j]/++$y; } -$tmp + log(2.5066282746310005*$ser/$x); } print hypergeom(300,700,100,40),"\n";
For small numbers it works fine. However, for large input it is too slow. For example, I have an input set that needs to call the hypergeom function 137544 times, and it takes about 70 seconds. As this is a web based tool, it is of course too slow.
My question is what other options do I have? I assumed the gammln is very fast but for large numbers it chokes.

Another option is Inline::C which is supposed to be fast but I don't know if calling a C function so often is the best option.
Math::GSL::CDF is also an option, but I'm not sure it is fast enough. Both modules need to be installed, and since I don't have permissions on the University server I would like some advice before asking for installation.

Any other ideas and suggestions would be welcome. This is a part of Perl I'm not very familiar with and would be happy to get input.
Also, I know the thread I mentioned earlier was pretty thorough, but it was also 5 years ago and I think it is worth asking this question again.
Thanks in advance
Guy Naamati

Q:What do Alexander the Great and Kermit the Frog have in common?

A: Their middle name
Update: Solution by BrowserUk brought the time down to 5 seconds. However, the output is different, and I am now trying to figure out why.
Thanks for your help,
Guy
Problem solved. Thanks again!

Replies are listed 'Best First'.
Re: Fast hypergeometric calculation in Perl
by BrowserUk (Patriarch) on Aug 24, 2010 at 09:16 UTC

    This is more than an order of magnitude faster which should get you into the realms of web-ability.

    Your code takes 0.000063938 seconds; the code below 0.000002392 seconds; which should reduce your 70 seconds to 2.6.

    I've precalculated all the logfacts for 0 .. 1000. If your inputs need a different range, recalculate:

    #!/usr/bin/perl -slw use strict; my @logfact = <DATA>; sub hypergeom { # There are m "bad" and n "good" balls in an urn. # Pick N of them. The probability of i or more successful selectio +ns: # (m!n!N!(m+n-N)!)/(i!(n-i)!(m+i-N)!(N-i)!(m+n)!) my ($n, $m, $N, $i) = @_; my $loghyp1 = $logfact[ $m ] + $logfact[ $n ] + $logfact[ $N ] + $logfact[ $m + $n - $N ]; my $loghyp2 = $logfact[ $i ] + $logfact[ $n - $i ] + $logfact[ $m + $i - $N ] + $logfact[ $N - $i ] + $logfact[ $m + $n ]; return exp($loghyp1 - $loghyp2); } print hypergeom( 300, 700, 100, 40 );

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast hypergeometric calculation in Perl
by salva (Canon) on Aug 24, 2010 at 10:14 UTC
    Besides BrowserUk suggestion (that's probably the way to go), some simple modifications on logfact (inlining things and eliminating unnecessary temporaries) makes it twice as fast:
    sub hypergeom1 { # There are m "bad" and n "good" balls in an urn. # Pick N of them. The probability of i or more successful selectio +ns: # (m!n!N!(m+n-N)!)/(i!(n-i)!(m+i-N)!(N-i)!(m+n)!) my ($n, $m, $N, $i) = @_; my $loghyp1 = logfact1($m) +logfact1($n)+logfact1($N)+logfact1($m+ +$n-$N); my $loghyp2 = logfact1($i)+logfact1($n-$i)+logfact1($m+$i-$N)+logf +act1($N-$i)+logfact1($m+$n); return exp($loghyp1 - $loghyp2); } sub logfact1 { my $x = shift; my $ser = ( 1.000000000190015 + 76.18009172947146 / ($x + 2) - 86.50532032941677 / ($x + 3) + 24.01409824083091 / ($x + 4) - 1.231739572450155 / ($x + 5) + 0.12086509738661e-2 / ($x + 6) - 0.5395239384953e-5 / ($x + 7) ); my $tmp = $x + 6.5; ($x + 1.5) * log($tmp) - $tmp + log(2.5066282746310005 * $ser / ($ +x+1)); }
Re: Fast hypergeometric calculation in Perl
by Khen1950fx (Canon) on Aug 24, 2010 at 09:22 UTC
    I ran your script and got 0.00693150741379456. I made a few minor changes and got 0.0069140819065811. Here's what I tried:
    #!/usr/bin/perl use strict; use warnings; sub logfact { return gammln(shift(@_) + 1); } sub hypergeom { my ($n, $m, $N, $i) = @_; my $loghyp1 = logfact($m)+logfact($n)+logfact($N)+logfact($m+$n-$N) +; my $loghyp2 = logfact($i)+logfact($n-$i)+logfact($m+$i-$N)+logfact( +$N-$i)+logfact($m+$n); return exp($loghyp1 - $loghyp2); } sub gammln { my $xx = shift; my @cof = (76.180091729471457, -86.505320329416776, 24.014098240830911, -1.231739572450155, 0.12086509738661, -5.395239384953e-06); my $y = my $x = $xx; my $tmp = $x + 5.5; $tmp -= ($x + 0.5) * log($tmp); my $ser = 1.0000000001900149; foreach my $j (0 .. 5) { $ser += $cof[$j] / ++$y; } -$tmp + log(2.5066282746310007*$ser/$x); } print hypergeom(300,700,100,40),"\n";
    I think that you'll need Math::GSL for Math::GSL::Randist. One other module that I like for getting gamma and log_gamma is Math::GammaFunction.

      Minor nit: As the author of Math::GammaFunction, I would consider Math::GSL a more mature, more reliable implementation.

      --Steffen

Re: Fast hypergeometric calculation in Perl
by roboticus (Chancellor) on Aug 24, 2010 at 11:49 UTC

    mrguy123:

    Have you tried to use Memoize to cache the results of logfact? If it's repeatedly being called with the same values, you might get a speed boost. You could try to Memoize hypergeom as well--but if there are too many combinations of arguments (i.e., the argument list rarely repeats) it would be a loss.

    ...roboticus

    Update: Just for my own amusement, I added Memoize to the program and found no significant improvements. Apparently the functions are fast enough that there's no big win to be found simply by caching the values. I replaced the single call at the end of the program with a nested loop (to generate enough runtime to be significant). This should call hypergeom 203,401 times:

    for my $i (280 .. 320) { for my $j (680 .. 720) { for my $k (95 .. 105) { for my $l (30 .. 40) { my $hg = hypergeom($i, $j, $k, $l); } } } }

    The results are:

    39.11s Without Memoize 35.25s Memoize logfact only 38.40s Memoize gammaln only 47.04s Memoize hypergeom only

    Of course, with the usage pattern given, I may have artificially improved/reduced Memoize's results, so your mileage will vary.

Re: Fast hypergeometric calculation in Perl
by zentara (Cardinal) on Aug 24, 2010 at 12:30 UTC
    have an input set that needs to call the hypergeom function 137544 times, and it takes about 70 seconds. .............., it is of course too slow. My question is what other options

    PDL is very fast, as fast as C when it is number crunching. Google for "PDL hypergeometric" for ideas. The PDL maillist is very helpful.


    I'm not really a human, but I play one on earth.
    Old Perl Programmer Haiku flash japh
Re: Fast hypergeometric calculation in Perl
by JavaFan (Canon) on Aug 24, 2010 at 09:43 UTC
    I have an input set that needs to call the hypergeom function 137544 times
    How many different $m, $n, $N, $m+$n-$N, $i, $n-$i, $m+$i-$N, $N-$i, $m+$n do you have? What happens if you cache the result of logfact? And how big is your $m+$n? Can't you just precalculate k! for 1 <= k <= $m+$n and then use a lookup in the webapplication? Then your calculation just becomes 9 lookups, 7 bignum additions and 1 bignum division.
Re: Fast hypergeometric calculation in Perl
by tospo (Hermit) on Aug 24, 2010 at 14:08 UTC

    And just as an aside: if installing modules locally is a problem due to not having root access: check out local::lib, which allows you to do just that.

Re: Fast hypergeometric calculation in Perl
by pemungkah (Priest) on Aug 24, 2010 at 17:43 UTC
    Your output difference may come down to the tiny differences you're going to get if you enter values by typing them in vs. caching them after you've calculated them once. You may be losing a few bits of precision. The exp($loghyp1 - $loghyp2) expression could also be causing you to lose precision because of subtractive cancellation.