more useful options PerlMonks

### standard deviation accuracy question

 on Jul 22, 2005 at 13:12 UTC Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

i wrote a function to compute standard deviation because i am in the ridiculous and unenviable position of using a networked version of perl with a non functioning ppm

(else i would have used Math::StdDev)

now it turns out my function doesn't give the same results as Excel or SPSS (its not because one is the sample stddev and the other the population stddev i already checked that)

next i copied a stddev function straight out of a linux journal article on benchmarking

```
sub std_dev_ref_sum {
my \$ar = shift;
my \$elements = scalar @\$ar;
my \$sum = 0;
my \$sumsq = 0;
foreach (@\$ar) {
\$sum += \$_;
\$sumsq += (\$_ **2);
}
return sqrt( \$sumsq/\$elements -
((\$sum/\$elements) ** 2));
}

this gives the same answers as my original code (hence i don't bother posting my original code as well)

does anyone know why i might be getting different results to these commercially available applications?

Replies are listed 'Best First'.
Re: standard deviation accuracy question
by NetWallah (Canon) on Jul 22, 2005 at 13:47 UTC
Just to get this info in PM for future lookups in case that link dies:

1. The binomial standard deviation

The binomial standard deviation applies to events with two outcomes: win or lose. For example, betting on heads in coin tossing can lead to win (the appearance of heads) or loss (the appearance of the opposite; tails, in this case). The binomial standard deviation is calculated by the following formula: Standard deviation = SQR{(N*p*(1-p)}

That is, the square root of: the number of trials (events) N, multiplied by the probability p, multiplied by the opposite probability (or 1 minus p). (where p is the probability of appearance and N represents the number of trials).

Suppose we toss a coin 100 times (N=100). The probability of heads is p=1/2=0.5. The standard deviation is SQR{100 * 0.5 * 0.5} = SQR(100 * .25) =SQR(25) = 5. The expected number of heads in 100 tosses is 0.5 * 100 = 50. The rule of normal probability proves that in 68.2% of the cases, the number of heads will fall within one standard deviation from the number of expected successes (50). That is, if we repeat 1000 times the event of tossing a coin 100 times, in 682 cases we'll encounter a number of heads between 45 and 55.

2. The statistical standard deviation

There is no formula to calculate the statistics standard deviation directly (?) That's what they told you in school. That's what they say in other public places with the self-proclaimed goal of education. Only an algorithm can lead to the standard deviation of a data series. Indeed, the algorithm is always available. The following are the steps of the algorithm implemented in my freeware Super Formula. Sum up data; calculate the mean average (sum total divided by the number of elements); deduct each element of the collection from the average; raise each difference to the power of 2; add up the squared differences; divide the new sum total by the number of elements in the data series; the result represents the variance; the square root of the variance represents the famous standard deviation.

A data series like 1, 2, 3, 6 has a mean average (mu) equal to: ? = (1+2+3+6)/4=3.

The differences from the mean are: -2, -1, 0, +3. The variance (sigma squared) is the measurement of the squared deviations. The variance is calculated as: ?˛ = {(-2)2 + (-1)2 + 0 + 32}/4=14/4=3.5. Finally, the standard deviation (sigma) is equal to the positive square root of the variance: ? = SQR(3.5)=1.87.

Nevertheless, there are formulae (plural, indeed) to calculate the statistical deviation in advance. There is a dominant deviation parameter in all the stochastic (probabilistic) events. In fact, all events are stochastic, since randomness is present in everything-there-is. Nothing-there-is can exist with absolute certainty (see the mathematics of the absurdity of absolute certainty: www.saliu.com/formula.htm). The elements of a stochastic phenomenon deviate from one another following mathematical rules. The difference is in the probability of the event (phenomenon). The probability then determines subsequent parameters, such as volatility, standard deviation, FFG deviation, etc.

In 2003 I announced that I had discovered a formula for a very important measure in the fluctuation of probability events: FFG deviation. See “New pairing research”. Soon thereafter I have been bombarded with requests to present the formula for FFG deviation and the statistical standard deviation. Of course, I was asked (in strong terms sometimes) to release also free software to accompany the formulae calculations. The requests were also presented in public forums, sometimes strongly worded.

At this time, I do not publish the formulae to calculate the FFG deviation and the statistical standard deviation. Such an act would serve people I do not want to serve. They belong to the following categories: gambling developers and high rollers; lottery systems and software developers; stock traders. I know exactly whom I am talking about. I have received many a message from them. They inundated my hou' with correspondence, including postal mail. They would be the ones that would charge serious money out of my effort. The vast majority of people do not really need to know exactly all the formulas involved in standard deviation calculations. Suffice to say that my software does incorporate standard deviation calculations. Also, the greatest random number/combination generator — IonSaliuGenerator — makes extraordinarily good usage of the standard deviation.

-Waswas
Re: standard deviation accuracy question
by thor (Priest) on Jul 22, 2005 at 13:52 UTC
Your function is doing it right. From reading the documentation on Excel's STDEV function, it says that it "estimates the standard deviation based on a sample"...whatever the hell that means. In perusing the functions a little further, there's a function called STDEVP whose description is "calculates the standard deviation based on the entire population given as arguments". Seems like that should be the standard one (no pun intended), but that's the way it goes.

thor

Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come

The sample standard deviation divides by N-1. The population standard deviation divides by N. The N - 1 version is used more often, because in real life you are working with samples. That's why it is the default in Excel.
Maybe I'm being dense, but if you you know all of the members in a set, why would you want to use a number that's non-representative of that set? If you wouldn't mind, a concrete example might be of assistance. I really want to understand this...

thor

Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come

Re: standard deviation accuracy question
by dave_the_m (Monsignor) on Jul 22, 2005 at 13:33 UTC
When you say you get different results, do you mean slightly different results, or wildly different results?

Or to put it another way, is this thread to be a discussion of the details of accumulated rounding errors and limits of precision, or a discussion of what is the correct algorithm to calculate std deviations?

Can you supply a short list of sample data and the output you get with your script and with excel?

Dave.

Re: standard deviation accuracy question
by jpeg (Chaplain) on Jul 22, 2005 at 14:06 UTC
How are you calcing standard deviation in Excel? There's STDEV, STDEVA, STDEVP, STDEVPA, off the top of my head.

Excel's STDEVP uses the n-1 as the denominator, so you should probably use

```my \$elements = \$#ar;
just to compare apples and apples.
--
jpg
Re: standard deviation accuracy question
by tye (Sage) on Jul 22, 2005 at 13:46 UTC

It would probably help a lot if you didn't refuse to tell us what the different values are.

Try using my \$elements = @\$arr - 1;. I believe I've seen "N-1" optionally used in the demoninator. I also suspect that \$_*\$_ will be more accurate than \$_**2, but I doubt that would matter enough to cause you to write a node.

- tye

Re: standard deviation accuracy question
by davidrw (Prior) on Jul 22, 2005 at 14:36 UTC
Instead of writing your own, can you fix or work around the ppm problem? Perhaps packaging with PAR? Or can you at least just copy the code from Math::StdDev and include it in your source?
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:02 UTC
the values i got
```condition 1 49.56
condition 2 43.31
condition 3 50.85
condition 4 55.09
the values excel gave
```condition 1 50.73
condition 2 44.33
condition 3 52.05
condition 4 56.39
the raw data
```cond1   cond2   cond3   cond4
397     360.5   372     324
322     318.5   354     305.5
350     356     358     364
346     316.5   311     320
338     330     336.5   327.5
324     296.5   297     288.5
461     444     485     432.5
477     474     461     434
432     416.5   407.5   447
373     355     365     369
386     320     342     315
412     401.5   428     397.5
475     363.5   494     515
360.5   332     343.5   339
384     363     372     405
324     324     324     330
424     388     379.5   396
319     312     345     319
454.5   344     389     345
412     372     384     381.5
387     347.5   363     340
370     338     346     317

I agree with Excel, for the most part:

```condition1 50.73
condition2 44.34
condition3 52.05
condition4 56.39

the lowliest monk

Re: standard deviation accuracy question
by tlm (Prior) on Jul 22, 2005 at 16:07 UTC

The code you posted does not compute the sample standard deviation, which is what Excel is computing. You need to make the N – 1 correction. (I.e. before taking the square root, multiply everything by N/(N–1)).

Or you can use the code I posted in my other reply in this thread (though yours, with the correction, is probably more efficient).

the lowliest monk

Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 14:54 UTC
ok i am trying to compute the sample standard deviation and i am comparing to excels STDEV function (which divides by N - 1 giving the samples standard deviation) my values aren't masssivly different from those given by excel but enough to be worrying i believe the algorithm is right i'm just supprised if a precision difference could cauze the differences i'm seeing i'll post the data in a sec probably getting round the broken ppm is the answer there are so many other things i need to install modules for anyway but i got used to writing my own bits for portability around this crazy network
the formula you're using in the original post divides by N. Of course you'll be getting different results.
--
jpg
Re: standard deviation accuracy question
by radiantmatrix (Parson) on Jul 22, 2005 at 17:40 UTC

As far as non-functional PPMs go, I would suggest finding a dev machine with ActiveState Perl and a functioning PPM, then installing Statistics::Basic::StdDev and its dependencies. You can then copy the C:\Perl\site\lib\ file tree to your target machine. All the modules in question are pure Perl, so you shouldn't have any issues doing this.

Once you have that module in place, you'll be in a better position to determine the significance of differences between Excel and Perl's output.

Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
The Code that can be seen is not the true Code
"In any sufficiently large group of people, most are idiots" - Kaa's Law
Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:18 UTC
does anyone know why i might be getting different results to these commercially available applications?
Are you suffering from numerical round-off problems? How many elements are in the array? How big are the individual elements compared to \$sum and \$sumsq? One way to test for numerical errors is to plot a graph of the standard deviation for various numbers of \$elements. If the graph looks like it converges to a value, you're OK, but if the answer plot starts to "get fuzzy" (i.e. more random) you probably have problems.
Re: standard deviation accuracy question
by spiritway (Vicar) on Jul 22, 2005 at 20:29 UTC

I had a similar problem and it drove me nuts... Your code looks OK. You didn't specify what the discrepancies were, but one thing to look into is whether your standard deviation is for a sample, or for the population. There is a difference. For a sample, you divide by (n-1); for a population, you divide by n (where n = the number of elements).

Re: standard deviation accuracy question
by Anonymous Monk on Jul 22, 2005 at 15:10 UTC
sorry
```condition1
397
322
350
346
338
324
461
477
432
373
386
412
475
360.5
384
324
424
319
454.5
412
387
370

condition2
360.5
318.5
356
316.5
330
296.5
444
474
416.5
355
320
401.5
363.5
332
363
324
388
312
344
372
347.5
338

condition3
372
354
358
311
336.5
297
485
461
407.5
365
342
428
494
343.5
372
324
379.5
345
389
384
363
346

condition4
324
305.5
364
320
327.5
288.5
432.5
434
447
369
315
397.5
515
339
405
330
396
319
345
381.5
340
317

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://477217]
Approved by Corion
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2023-04-02 06:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?

No recent polls found

Notices?