in reply to Sorting Votes, Confidence & Deviations

I think you will have the most luck redefining your question. I'll propose a different measure and then suggest a way to find it here: Basically, you think you are looking for the highest rated item, but here's another way of phrasing this that is more useful- you want to display the item that you believe has the highest percentage chance of being one a user will like. Why don't you use statistics standardly accepted "margin of error" or "confidence" formulas to equalize this out? Here's my proposed algorithm: 1. Assign everything a percentage score (based on pure division. 5/5 with 1000 votes is 100%, 5/5 with 2 votes is 100%, 3/5 with 1000 votes is 60%, etc). 2. Based on the sample size (number of voters) as opposed to your total user base (number of registered users), find the margin of error using the standard formula found in this gif: http://upload.wikimedia.org/math/7/3/a/73a9acf851e5c9fe3cdfcf52a1612cd0.png 3. Penalize every score by decrementing it by the largest assumable negative margin of error (better safe than sorry). This will all work out very nicely due to the laws of statistics. The 5/5 with 100 votes will fall to something like 98%, whereas the 5/5 with only 2 votes will have an absurd margin of error (since 2 tells you almost nothing) and fall to something like 40%, thereby losing out to the only minorly decremented 3/5 with 100 votes, which might fall to something like 58%. What does anybody else think of this algorithm?
  • Comment on Re: Sorting Votes, Confidence & Deviations

Replies are listed 'Best First'.
Re^2: Sorting Votes, Confidence & Deviations
by billisdog (Sexton) on Apr 12, 2007 at 21:08 UTC
    Just adding this on- the reason I like the above solution so much is that it penalizes low confidence entries, but not with any kind of magic number- it does with it perhaps the most time tested and mathematically backed up method of penalizing low confidence entries ever!
Re^2: Sorting Votes, Confidence & Deviations
by Anonymous Monk on Apr 13, 2007 at 14:40 UTC
    While I agree, with the use of Margin of Error/Confidence values, I think your approach is overly pessimistic. Take for example an entry that gets a single minimum vote (1 or 0) out of 5. This is just as likely to be deceptive and your approach would, incorrectly I believe, drive this value farther below minimum. As this problem is most noticeable only at/near 100% & 0%, I suggest another approach. Generate 2 values, Score + Error and Score - Error. Trim these values to the range in question, and average the Trimmed values for a final score. This would have the effect of moving scores that are pegged at the end of the range inward by Half their *possible* error. Scores near the center of the range will be left largely unaffected.
      I agree that the approach is probably too pessimistic. It essentially takes the view that all uncertainty is downside.

      I would be reluctant to fudge the calculation of the mean. Which aspect of the data has more "information" to it? The mean or the confidence? That should guide you about how to combine the two.

      To my eye the least invasive treatment would be to compute the means and the standard errors, and then use the standard error to break ties when sorting by the means. Obviously you need the actual values in the population for this (to compute the standard deviation). For the other ones, maybe you can stick in a "default" standard error (don't know what you're scoring, so who knows).