hailholyghost has asked for the wisdom of the Perl Monks concerning the following question:

I was recently tasked with sorting by Kolmogorov-Smirnov p-values. I can do this inside Perl with the Statistics::R package (http://search.cpan.org/…/Statistics-R-0…/lib/Statistics/R.pm) but this package calls R externally, and is thus extraordinarily slow. I've also found GSL packages, but these don't have the Kolmogorov-Smirnov test available. I have the source code for the ks.test function inside R, but as I said, this is impractical. How can I translate this to Perl, or C?
  • Comment on Translation of R functions to Perl subroutines

Replies are listed 'Best First'.
Re: Translation of R functions to Perl subroutines
by choroba (Cardinal) on Oct 12, 2017 at 15:35 UTC
    The easiest way is to find a person who understands both the source and target language, and pay them for the translation. Another option, cheaper but potentially slower, is to learn the languages yourself. Without having seen the source, it's hard to tell how complex the translation would be.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Translation of R functions to Perl subroutines
by Anonymous Monk on Oct 12, 2017 at 15:51 UTC
    The Kolmogorov–Smirnov test seems pretty straightforward, but the devil is in the details, as they say. If you have the cumulative distribution function (CDF) available (you didn't say which distribution you're dealing with), you just sort your data values and compare them with the CDF.
Re: Translation of R functions to Perl subroutines
by Laurent_R (Canon) on Oct 12, 2017 at 21:57 UTC
    I don't know anything about Kolmogorov-Smirnov p-values, but I tend to believe that R is usually fairly efficient. So I would tend to think that if this is "extraordinarily slow," then it is probably not used correctly. Just my 2 cents.
Re: Translation of R functions to Perl subroutines
by Anonymous Monk on Oct 12, 2017 at 19:33 UTC
    Perhaps therefore you should challenge the original requirement: if these values are prohibitively slow to calculate "one at a time," as you seem to have just determined, then you should look for an alternate approach, perhaps with the assistance of your friendly neighborhood statistician boss. Maybe the calculation of p-values demands re-processing of the entire dataset ... once for each and every member of that dataset ... thus becoming a prohibitively-expensive O(x^y) computational approach for the number of observations that must be considered.