Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: standard deviation accuracy question

by thor (Priest)
on Jul 22, 2005 at 13:52 UTC ( [id://477226]=note: print w/replies, xml ) Need Help??


in reply to standard deviation accuracy question

Your function is doing it right. From reading the documentation on Excel's STDEV function, it says that it "estimates the standard deviation based on a sample"...whatever the hell that means. In perusing the functions a little further, there's a function called STDEVP whose description is "calculates the standard deviation based on the entire population given as arguments". Seems like that should be the standard one (no pun intended), but that's the way it goes.

thor

Feel the white light, the light within
Be your own disciple, fan the sparks of will
For all of us waiting, your kingdom will come

  • Comment on Re: standard deviation accuracy question

Replies are listed 'Best First'.
Re^2: standard deviation accuracy question
by itub (Priest) on Jul 22, 2005 at 14:28 UTC
    The sample standard deviation divides by N-1. The population standard deviation divides by N. The N - 1 version is used more often, because in real life you are working with samples. That's why it is the default in Excel.
      Maybe I'm being dense, but if you you know all of the members in a set, why would you want to use a number that's non-representative of that set? If you wouldn't mind, a concrete example might be of assistance. I really want to understand this...

      thor

      Feel the white light, the light within
      Be your own disciple, fan the sparks of will
      For all of us waiting, your kingdom will come

        It all depends on the nature of the data.

        If you're together with 20 friends of yours, and you suddenly ask yourself: "what is the our average age?", you can easily calculate it. Then you can ask: "Ok, now what is the standard deviation of our ages?", because you would like how far from that average you are as a group (I mean, you could average 30 because half are 20 and half are 40, or you could average 30 because all of you are 30). In this case, the population is quite restricted (20 friends), and you're able to carry your calculations on exactly - so you can calculate the population standard deviation. The formula for such calculation is the following (assuming a population of N friends):

        x1**2 + x2**2 + ... + xN**2 population_stdev = sqrt(---------------------------) N xi = age_i - average_age
        where age_i is the age of your i-th friend in the group, of course.

        Now consider the task of calculating the standard deviation over all the people in USA, for example. Doing it all on your own: no databases, no Internet, nothing. How can you approach such a problem? You have to guess: you restrict your investigation to a quite small portion of the entire population, do your calculations and hope that what you observed in this restricted view can be scaled well to the entire USA population. The restricted group you're considering is called a sample; sometimes, you get nothing better than a restricted sample out of the entire population.

        In this case, the problem suffers from a clear shift in perspective. What was exact mathematics in the first time, dealing with exactly 20 friends, now becomes a matter of estimation. So, instead of using formulas to calculate stuff exactly, you have to build up formulas that are useful to estimate stuff as good as possible. In particular, the problem here is building up a formula that does not contain a-priori errors, AKA biases.

        For the average, i.e. the mean value, it can be demonstrated that the formula is pretty the same as in the exact case. That is, if you use the formula, it can be demonstrated that the result is not biased. This does not mean that it will be equal to the true value you would find if you calculated it over the entire population; it only means that you're not introducing systematic errors, i.e. errors that do not depend on the randomness of the data.

        For the standard deviation, the issue is a little trickier. The standard deviation is the square root of another thing called variance (the stuff inside the sqrt() function); it can be proved that the formula above inside the sqrt() function will always introduce a multiplicative bias, which is equal to (N - 1) / N. How do you correct this bias? Simple: you multiply by N and divide by (N - 1). That's why the formula for an unbiased estimate of the standard deviation is the following:

        x1**2 + x2**2 + ... + xN**2 sample_stdev = sqrt(---------------------------) N - 1 xi = age_i - average_age
        Here, N is the size of the sample, of course.

        Just to conclude, keep in mind that sometimes you cannot even think of getting values for the entire population. For example, if your input data are measures of the voltage of a battery, you can potentially get infinite, slightly different measures, without the hope to get them in your lifetime :)

        Flavio
        perl -ple'$_=reverse' <<<ti.xittelop@oivalf

        Don't fool yourself.
        the sample is the data set i have - it is drawn from the a potentially infinite data set known as the population as we cannot collect an infinite (or in practice unfeasably large) amount of data we are almost never working with the population but with a sample drawn from it the standard deviation given by excels STDEVP function asumes that the data set is an entire population and is used in those few instances when we are in fact working with the population which is why it isn't the default as the sample size N increases these two functions should converge (as the ratio N / N -1 approaches 1) the reason that i need the SAMPLE std dev is that i am using this to compute the standard error which is the standard deviation of all the differences between the means of all the different possible samples i could draw from the population and the population mean itself - it gives a measure of how reliable any given sample mean is as an approximation to the population mean i.e. it is a measure of the amount of noise in your data it is (i believe) a consequence of the central limit theorem that the standard error can be computed from the standard deviation of the sample as stddev / sqrt(sample_size)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://477226]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (4)
As of 2024-04-20 13:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found