Re^4: standard deviation accuracy question

It all depends on the nature of the data.

If you're together with 20 friends of yours, and you suddenly ask yourself: "what is the our average age?", you can easily calculate it. Then you can ask: "Ok, now what is the standard deviation of our ages?", because you would like how far from that average you are as a group (I mean, you could average 30 because half are 20 and half are 40, or you could average 30 because all of you are 30). In this case, the population is quite restricted (20 friends), and you're able to carry your calculations on exactly - so you can calculate the population standard deviation. The formula for such calculation is the following (assuming a population of N friends):

                           x1**2 + x2**2 + ... + xN**2
   population_stdev = sqrt(---------------------------)
                                       N

   xi = age_i - average_age
[download]

where age_i is the age of your i-th friend in the group, of course.

Now consider the task of calculating the standard deviation over all the people in USA, for example. Doing it all on your own: no databases, no Internet, nothing. How can you approach such a problem? You have to guess: you restrict your investigation to a quite small portion of the entire population, do your calculations and hope that what you observed in this restricted view can be scaled well to the entire USA population. The restricted group you're considering is called a sample; sometimes, you get nothing better than a restricted sample out of the entire population.

In this case, the problem suffers from a clear shift in perspective. What was exact mathematics in the first time, dealing with exactly 20 friends, now becomes a matter of estimation. So, instead of using formulas to calculate stuff exactly, you have to build up formulas that are useful to estimate stuff as good as possible. In particular, the problem here is building up a formula that does not contain a-priori errors, AKA biases.

For the average, i.e. the mean value, it can be demonstrated that the formula is pretty the same as in the exact case. That is, if you use the formula, it can be demonstrated that the result is not biased. This does not mean that it will be equal to the true value you would find if you calculated it over the entire population; it only means that you're not introducing systematic errors, i.e. errors that do not depend on the randomness of the data.

For the standard deviation, the issue is a little trickier. The standard deviation is the square root of another thing called variance (the stuff inside the sqrt() function); it can be proved that the formula above inside the sqrt() function will always introduce a multiplicative bias, which is equal to (N - 1) / N. How do you correct this bias? Simple: you multiply by N and divide by (N - 1). That's why the formula for an unbiased estimate of the standard deviation is the following:

                           x1**2 + x2**2 + ... + xN**2
       sample_stdev = sqrt(---------------------------)
                                     N - 1

   xi = age_i - average_age
[download]

Here, N is the size of the sample, of course.

Just to conclude, keep in mind that sometimes you cannot even think of getting values for the entire population. For example, if your input data are measures of the voltage of a battery, you can potentially get infinite, slightly different measures, without the hope to get them in your lifetime :)

Flavio
perl -ple'$_=reverse' <<<ti.xittelop@oivalf

Don't fool yourself.

Comment on Re^4: standard deviation accuracy question Select or Download Code

Replies are listed 'Best First'.
Re^5: standard deviation accuracy question by thor (Priest) on Jul 23, 2005 at 23:19 UTC
Thanks much! There's a little hand waving going on here, but the explanation is good enough for me. thor `Feel the white light, the light within Be your own disciple, fan the sparks of will For all of us waiting, your kingdom will come`	[reply]
Re^6: standard deviation accuracy question by tlm (Prior) on Jul 24, 2005 at 00:45 UTC
The best intuitive explanation I have come across for why dividing by N the sum of squared deviations from the sample mean underestimates the population variance is that the sample mean "follows" the sample; i.e. the sample almost always deviates from its own mean less than it deviates from the population mean (and it never deviates more). This is the source of the bias frodo72 alluded to. This intuitive argument only shows that simply taking the sample average of squared deviations from the sample mean will underestimate the population variance, but it does not at all prove that N/(N − 1) is the right correction factor. I don't know of an intuitive argument for this, but a nice rigorous derivation can be found here. the lowliest monk	[reply]
Re^6: standard deviation accuracy question by polettix (Vicar) on Jul 24, 2005 at 13:45 UTC
This is no explaination, but it may help. When you concentrate on a sample instead of the entire population, you're doing two estimates: the mean and the variance (square of standard deviation). The issue is that when you estimate the variance, you're subtracting the estimated mean from each item in the sample: you're using an estimation inside another estimation. Which leads us to the concept of degree of freedom. The sample has N degrees of freedom, i.e. N possibility to be modified: you can have different values for each of the N items. Thus, when you estimate the mean value, you divide by N. When you estimate the variance, you're using the mean value evaluated over the sample, as said. Given the fact that you're implicitly trusting that mean value to be correct (otherwise you'd not use it to evaluate the variance!), you're stealing a degree of freedom. I mean: if you fix the value of the mean, you can move only (N-1) items, and the N-th will be bound to have a value that leads to the given mean value. Thus, a variance evaluated in this way only takes into account the variations brought by (N-1) items, not N. Hope that this intuitively helps :) Flavio perl -ple'$_=reverse' <<<ti.xittelop@oivalf Don't fool yourself.	[reply]


Think about Loose Coupling
	PerlMonks