http://qs1969.pair.com?node_id=477217

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

i wrote a function to compute standard deviation because i am in the ridiculous and unenviable position of using a networked version of perl with a non functioning ppm

(else i would have used Math::StdDev)

now it turns out my function doesn't give the same results as Excel or SPSS (its not because one is the sample stddev and the other the population stddev i already checked that)

next i copied a stddev function straight out of a linux journal article on benchmarking

http://www.linuxjournal.com/article/6540

sub std_dev_ref_sum { my $ar = shift; my $elements = scalar @$ar; my $sum = 0; my $sumsq = 0; foreach (@$ar) { $sum += $_; $sumsq += ($_ **2); } return sqrt( $sumsq/$elements - (($sum/$elements) ** 2)); }

this gives the same answers as my original code (hence i don't bother posting my original code as well)

does anyone know why i might be getting different results to these commercially available applications?

20050722 Janitored by Corion: Added formatting, linkified link

Replies are listed 'Best First'.
Re: standard deviation accuracy question
by NetWallah (Canon) on Jul 22, 2005 at 13:47 UTC
      Just to get this info in PM for future lookups in case that link dies:


      1. The binomial standard deviation

      The binomial standard deviation applies to events with two outcomes: win or lose. For example, betting on heads in coin tossing can lead to win (the appearance of heads) or loss (the appearance of the opposite; tails, in this case). The binomial standard deviation is calculated by the following formula: Standard deviation = SQR{(N*p*(1-p)}

      That is, the square root of: the number of trials (events) N, multiplied by the probability p, multiplied by the opposite probability (or 1 minus p). (where p is the probability of appearance and N represents the number of trials).


      Suppose we toss a coin 100 times (N=100). The probability of heads is p=1/2=0.5. The standard deviation is SQR{100 * 0.5 * 0.5} = SQR(100 * .25) =SQR(25) = 5. The expected number of heads in 100 tosses is 0.5 * 100 = 50. The rule of normal probability proves that in 68.2% of the cases, the number of heads will fall within one standard deviation from the number of expected successes (50). That is, if we repeat 1000 times the event of tossing a coin 100 times, in 682 cases we'll encounter a number of heads between 45 and 55.


      2. The statistical standard deviation

      There is no formula to calculate the statistics standard deviation directly (?) That's what they told you in school. That's what they say in other public places with the self-proclaimed goal of education. Only an algorithm can lead to the standard deviation of a data series. Indeed, the algorithm is always available. The following are the steps of the algorithm implemented in my freeware Super Formula. Sum up data; calculate the mean average (sum total divided by the number of elements); deduct each element of the collection from the average; raise each difference to the power of 2; add up the squared differences; divide the new sum total by the number of elements in the data series; the result represents the variance; the square root of the variance represents the famous standard deviation.


      A data series like 1, 2, 3, 6 has a mean average (mu) equal to: ? = (1+2+3+6)/4=3.

      The differences from the mean are: -2, -1, 0, +3. The variance (sigma squared) is the measurement of the squared deviations. The variance is calculated as: ?˛ = {(-2)2 + (-1)2 + 0 + 32}/4=14/4=3.5. Finally, the standard deviation (sigma) is equal to the positive square root of the variance: ? = SQR(3.5)=1.87.


      Nevertheless, there are formulae (plural, indeed) to calculate the statistical deviation in advance. There is a dominant deviation parameter in all the stochastic (probabilistic) events. In fact, all events are stochastic, since randomness is present in everything-there-is. Nothing-there-is can exist with absolute certainty (see the mathematics of the absurdity of absolute certainty: www.saliu.com/formula.htm). The elements of a stochastic phenomenon deviate from one another following mathematical rules. The difference is in the probability of the event (phenomenon). The probability then determines subsequent parameters, such as volatility, standard deviation, FFG deviation, etc.


      In 2003 I announced that I had discovered a formula for a very important measure in the fluctuation of probability events: FFG deviation. See “New pairing research”. Soon thereafter I have been bombarded with requests to present the formula for FFG deviation and the statistical standard deviation. Of course, I was asked (in strong terms sometimes) to release also free software to accompany the formulae calculations. The requests were also presented in public forums, sometimes strongly worded.


      At this time, I do not publish the formulae to calculate the FFG deviation and the statistical standard deviation. Such an act would serve people I do not want to serve. They belong to the following categories: gambling developers and high rollers; lottery systems and software developers; stock traders. I know exactly whom I am talking about. I have received many a message from them. They inundated my hou' with correspondence, including postal mail. They would be the ones that would charge serious money out of my effort. The vast majority of people do not really need to know exactly all the formulas involved in standard deviation calculations. Suffice to say that my software does incorporate standard deviation calculations. Also, the greatest random number/combination generator — IonSaliuGenerator — makes extraordinarily good usage of the standard deviation.



      -Waswas
Re: standard deviation accuracy question
by thor (Priest) on Jul 22, 2005 at 13:52 UTC
    Your function is doing it right. From reading the documentation on Excel's STDEV function, it says that it "estimates the standard deviation based on a sample"...whatever the hell that means. In perusing the functions a little further, there's a function called STDEVP whose description is "calculates the standard deviation based on the entire population given as arguments". Seems like that should be the standard one (no pun intended), but that's the way it goes.

    thor

    Feel the white light, the light within
    Be your own disciple, fan the sparks of will
    For all of us waiting, your kingdom will come

      The sample standard deviation divides by N-1. The population standard deviation divides by N. The N - 1 version is used more often, because in real life you are working with samples. That's why it is the default in Excel.
        Maybe I'm being dense, but if you you know all of the members in a set, why would you want to use a number that's non-representative of that set? If you wouldn't mind, a concrete example might be of assistance. I really want to understand this...

        thor

        Feel the white light, the light within
        Be your own disciple, fan the sparks of will
        For all of us waiting, your kingdom will come

Re: standard deviation accuracy question
by dave_the_m (Monsignor) on Jul 22, 2005 at 13:33 UTC
    When you say you get different results, do you mean slightly different results, or wildly different results?

    Or to put it another way, is this thread to be a discussion of the details of accumulated rounding errors and limits of precision, or a discussion of what is the correct algorithm to calculate std deviations?

    Can you supply a short list of sample data and the output you get with your script and with excel?

    Dave.

Re: standard deviation accuracy question
by jpeg (Chaplain) on Jul 22, 2005 at 14:06 UTC
    How are you calcing standard deviation in Excel? There's STDEV, STDEVA, STDEVP, STDEVPA, off the top of my head.

    Excel's STDEVP uses the n-1 as the denominator, so you should probably use

    my $elements = $#ar;
    just to compare apples and apples.
    --
    jpg
Re: standard deviation accuracy question
by tye (Sage) on Jul 22, 2005 at 13:46 UTC

    It would probably help a lot if you didn't refuse to tell us what the different values are.

    Try using my $elements = @$arr - 1;. I believe I've seen "N-1" optionally used in the demoninator. I also suspect that $_*$_ will be more accurate than $_**2, but I doubt that would matter enough to cause you to write a node.

    - tye        

Re: standard deviation accuracy question
by davidrw (Prior) on Jul 22, 2005 at 14:36 UTC
    Instead of writing your own, can you fix or work around the ppm problem? Perhaps packaging with PAR? Or can you at least just copy the code from Math::StdDev and include it in your source?
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:02 UTC
    the values i got
    condition 1 49.56 condition 2 43.31 condition 3 50.85 condition 4 55.09
    the values excel gave
    condition 1 50.73 condition 2 44.33 condition 3 52.05 condition 4 56.39
    the raw data
    cond1 cond2 cond3 cond4 397 360.5 372 324 322 318.5 354 305.5 350 356 358 364 346 316.5 311 320 338 330 336.5 327.5 324 296.5 297 288.5 461 444 485 432.5 477 474 461 434 432 416.5 407.5 447 373 355 365 369 386 320 342 315 412 401.5 428 397.5 475 363.5 494 515 360.5 332 343.5 339 384 363 372 405 324 324 324 330 424 388 379.5 396 319 312 345 319 454.5 344 389 345 412 372 384 381.5 387 347.5 363 340 370 338 346 317

      I agree with Excel, for the most part:

      condition1 50.73 condition2 44.34 condition3 52.05 condition4 56.39

      the lowliest monk

Re: standard deviation accuracy question
by tlm (Prior) on Jul 22, 2005 at 16:07 UTC

    The code you posted does not compute the sample standard deviation, which is what Excel is computing. You need to make the N – 1 correction. (I.e. before taking the square root, multiply everything by N/(N–1)).

    Or you can use the code I posted in my other reply in this thread (though yours, with the correction, is probably more efficient).

    the lowliest monk

Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 14:54 UTC
    ok i am trying to compute the sample standard deviation and i am comparing to excels STDEV function (which divides by N - 1 giving the samples standard deviation) my values aren't masssivly different from those given by excel but enough to be worrying i believe the algorithm is right i'm just supprised if a precision difference could cauze the differences i'm seeing i'll post the data in a sec probably getting round the broken ppm is the answer there are so many other things i need to install modules for anyway but i got used to writing my own bits for portability around this crazy network
      the formula you're using in the original post divides by N. Of course you'll be getting different results.
      --
      jpg
Re: standard deviation accuracy question
by radiantmatrix (Parson) on Jul 22, 2005 at 17:40 UTC

    As far as non-functional PPMs go, I would suggest finding a dev machine with ActiveState Perl and a functioning PPM, then installing Statistics::Basic::StdDev and its dependencies. You can then copy the C:\Perl\site\lib\ file tree to your target machine. All the modules in question are pure Perl, so you shouldn't have any issues doing this.

    Once you have that module in place, you'll be in a better position to determine the significance of differences between Excel and Perl's output.

    <-radiant.matrix->
    Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:18 UTC
    does anyone know why i might be getting different results to these commercially available applications?
    Are you suffering from numerical round-off problems? How many elements are in the array? How big are the individual elements compared to $sum and $sumsq? One way to test for numerical errors is to plot a graph of the standard deviation for various numbers of $elements. If the graph looks like it converges to a value, you're OK, but if the answer plot starts to "get fuzzy" (i.e. more random) you probably have problems.
Re: standard deviation accuracy question
by spiritway (Vicar) on Jul 22, 2005 at 20:29 UTC

    I had a similar problem and it drove me nuts... Your code looks OK. You didn't specify what the discrepancies were, but one thing to look into is whether your standard deviation is for a sample, or for the population. There is a difference. For a sample, you divide by (n-1); for a population, you divide by n (where n = the number of elements).

Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:10 UTC
    sorry
    condition1 397 322 350 346 338 324 461 477 432 373 386 412 475 360.5 384 324 424 319 454.5 412 387 370 condition2 360.5 318.5 356 316.5 330 296.5 444 474 416.5 355 320 401.5 363.5 332 363 324 388 312 344 372 347.5 338 condition3 372 354 358 311 336.5 297 485 461 407.5 365 342 428 494 343.5 372 324 379.5 345 389 384 363 346 condition4 324 305.5 364 320 327.5 288.5 432.5 434 447 369 315 397.5 515 339 405 330 396 319 345 381.5 340 317