in reply to Seeking abnormalities in data sets.

In Orwant's book, Mastering Algorithms with Perl, at the end of Chapter 15 (Statistics) he talks about finding a "best-fit" straight line (linear least squares, regression line) to a set of data points -- a standard y=bx+a kind of thing from HS Algebra.

Once you have that for a given range of points, I'd think it would be a small matter to find the (correlation coefficient, r-to-t transformation) rogues using the distance from that line to a given point in the set to see if any single point was really whacked out.

Since I'm not willing to re-type the subroutines here, use the parenthetical terms above in a search engine to find a good algorithm you can transcribe to Perl. Or the MAP examples may be online somewhere at ORA (as they are for the Cookbook and other ORA publications).

update: Fixed attribution.

  • Comment on Re: Seeking abnormalities in data sets.