Seeking abnormalities in data sets.

ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Seeking abnormalities in data sets. by clintp (Curate) on Dec 27, 2001 at 01:37 UTC
In Orwant's book, Mastering Algorithms with Perl, at the end of Chapter 15 (Statistics) he talks about finding a "best-fit" straight line (linear least squares, regression line) to a set of data points -- a standard y=bx+a kind of thing from HS Algebra. Once you have that for a given range of points, I'd think it would be a small matter to find the (correlation coefficient, r-to-t transformation) rogues using the distance from that line to a given point in the set to see if any single point was really whacked out. Since I'm not willing to re-type the subroutines here, use the parenthetical terms above in a search engine to find a good algorithm you can transcribe to Perl. Or the MAP examples may be online somewhere at ORA (as they are for the Cookbook and other ORA publications). update: Fixed attribution.	[reply]
Re: Seeking abnormalities in data sets. by termix (Beadle) on Dec 27, 2001 at 01:24 UTC
If I understand your problem correctly, you wish to do curve fitting in PERL using large ammounts of data and then detect excepts that are identified. Yes, there are a number of statistical methods to accomplish that (which I know very little about). The statistical modules for PERL might help. Check them out here. (or try the Math modules here). I know there is a book that talks about specific curve fitting examples. Ah, here it is. I believe if you know the math behind the subject, then you can create your own algorithm with the help of this book (and contribute to CPAN!). May be you don't have to do the curve fitting in PERL. You could use a statistics package for that if there is one that you use and have code/scripts already written for. PERL can be your data parsing and results presentation system and coordinate the work of the curve fitting program. -- termix	[reply]
Re: Seeking abnormalities in data sets. by scain (Curate) on Dec 27, 2001 at 02:47 UTC
Do you know that your data sets will always be either (a) linear (ie, y= mx +b) or (b) exponential (y = Aexp(Bx))? If that's the case, then you should be able to use linear least squares as suggested above. Since a and b are separate issues, you would have to try both, and deside on a case by case basis which is better. Also, in the case of b, you can convert it to a linear problem by taking the log of y and plotting that against x. (At least that feels right at the moment... log(y) = log(A) + Bx... yeah, that's it). If your data could be of other forms, like higher order polynomials, then you will have to try all options, and it would turn into a slow mess, since you would have to try all of them for any given set. Good luck, Scott	[reply]
Re: Seeking abnormalities in data sets. by toma (Vicar) on Dec 27, 2001 at 09:10 UTC
For your task PDL will be enormously valuable. It will be worth every bit of effort to obtain and learn it. There is the nice PDL::Fit::Linfit module which does a general curve-fit to a linear combination of specified functions. PDL also has functions for selecting the inconsistent data, creating plots, and generating statistical summaries. The PDL module is a perl extension written in C and FORTRAN. In my experience it is many times faster than the equivalent routines written in pure perl. It should be quick with 40,000 point datasets. It should work perfectly the first time! - toma	[reply]
Re: Seeking abnormalities in data sets. by newbie00 (Beadle) on Dec 27, 2001 at 03:36 UTC
Hello. First, in order to properly analyze your data, you must know within an acceptable level of confidence, that the model you are using is the appropriate model, be it linear, exponential, or other. For example, you can use the correlation coefficient for e.g. the linear model to determine if enough of the error can be explained by that model to provide you with enough confidence that the correct model is being used (see a statistics book that contains linear and non-linear regression techniques). Without getting into too much detail, as a crude method, say for instance, if you don't have the background to analyze the data to the necessary degree, if you have a 'target' value for each point (e.g. in time, or other), and say, you don't want to accept data more than say, +/- 3%, you can calculate a 'band' around that 'target' data. Then you can plot your actual data along with these bands (you will have 3 curves using point-to-point vs. fitting a regression, especially if you don't have either the tools or background necessary to determine the actual regression model each time you collect the 40,000 data points) and visually look at the data. If the actual data falls outside of this band, then you may want to look at that particular data point a little closer. That does not mean automatically exclude it, unless you have enough info to support excluding it. This method is again, considered 'crude'. You can use e.g. Microsoft Excel to import your data into (e.g. using a comma-delimited format for the data, which you can get your Perl program can create for you; you can calculate your bands either within Excel very easily (preferred to keep imported filesize to a minimum) for plotting and/or analysis. This software has statistical routines built-in. Plus, there is a book called, "Microsoft Excel 2000 Formulas" by John Walkenbach (ISBN 0-7645-4609-0) that may provide you with more info for that software. Of course, there are other stats books you can use with this software. Be cautious in using crude methods -- what I mean is, don't try to read too much into the results. These types of methods are many times used to provide you with a 'direction', not conclusions. Hope some of this helps. Regards. --newbie00	[reply]
Re: Re: Seeking abnormalities in data sets. by tmiklas (Hermit) on Dec 27, 2001 at 06:42 UTC
Hello! Well, IMHO what newbie00 said - calculating acceptable band for analysed values, is used rather in situations when you have an exact 'middle' value, which is the standard and required one. BTW - this is used in quality management. If the model is linear this is ok, but in other case the only way to decide if everything went ok, is to prepare the curve computed by an exact (expected) model, stimate acceptable difference and compare those two curves ;) IMHO Excel is only a workaround to visualize data, but AFAIK Excel it _will_not_ cooperate so easy with anything else than Execl itself :-( Besides, if you are using execl, you have to decide yourself whether there are any abnormalities using a graph... Ha! So why do i need a perl program?! ;-> If you don't need a graph, why do you use Excel? Excel is also some solution; everything depends on what you _really_ need. Best regards to everyone here. --tmiklas	[reply]
Re: Re: Seeking abnormalities in data sets. by scain (Curate) on Dec 27, 2001 at 20:36 UTC
I would caution against using Excel and a correlation coefficent. The correlation coeffecient (often referred to as r^2) is relatively insensitive to variations in the data. A better measure of goodness of fit is chi squared, which you can calculate when you do a least squares fit. Check out Numerical Recipes chapter 15 for details on doing the calculations. Scott	[reply]


No such thing as a small change
	PerlMonks