Re: Seeking abnormalities in data sets.

Hello.

First, in order to properly analyze your data, you must know within an acceptable level of confidence, that the model you are using is the appropriate model, be it linear, exponential, or other. For example, you can use the correlation coefficient for e.g. the linear model to determine if enough of the error can be explained by that model to provide you with enough confidence that the correct model is being used (see a statistics book that contains linear and non-linear regression techniques).

Without getting into too much detail, as a crude method, say for instance, if you don't have the background to analyze the data to the necessary degree, if you have a 'target' value for each point (e.g. in time, or other), and say, you don't want to accept data more than say, +/- 3%, you can calculate a 'band' around that 'target' data. Then you can plot your actual data along with these bands (you will have 3 curves using point-to-point vs. fitting a regression, especially if you don't have either the tools or background necessary to determine the actual regression model each time you collect the 40,000 data points) and visually look at the data. If the actual data falls outside of this band, then you may want to look at that particular data point a little closer. That does not mean automatically exclude it, unless you have enough info to support excluding it. This method is again, considered 'crude'.

You can use e.g. Microsoft Excel to import your data into (e.g. using a comma-delimited format for the data, which you can get your Perl program can create for you; you can calculate your bands either within Excel very easily (preferred to keep imported filesize to a minimum) for plotting and/or analysis. This software has statistical routines built-in. Plus, there is a book called, "Microsoft Excel 2000 Formulas" by John Walkenbach (ISBN 0-7645-4609-0) that may provide you with more info for that software. Of course, there are other stats books you can use with this software.

Be cautious in using crude methods -- what I mean is, don't try to read too much into the results. These types of methods are many times used to provide you with a 'direction', not conclusions.

Hope some of this helps.

Regards. --newbie00

Comment on Re: Seeking abnormalities in data sets.

Replies are listed 'Best First'.
Re: Re: Seeking abnormalities in data sets. by tmiklas (Hermit) on Dec 27, 2001 at 06:42 UTC
Hello! Well, IMHO what newbie00 said - calculating acceptable band for analysed values, is used rather in situations when you have an exact 'middle' value, which is the standard and required one. BTW - this is used in quality management. If the model is linear this is ok, but in other case the only way to decide if everything went ok, is to prepare the curve computed by an exact (expected) model, stimate acceptable difference and compare those two curves ;) IMHO Excel is only a workaround to visualize data, but AFAIK Excel it _will_not_ cooperate so easy with anything else than Execl itself :-( Besides, if you are using execl, you have to decide yourself whether there are any abnormalities using a graph... Ha! So why do i need a perl program?! ;-> If you don't need a graph, why do you use Excel? Excel is also some solution; everything depends on what you _really_ need. Best regards to everyone here. --tmiklas	[reply]
Re: Re: Seeking abnormalities in data sets. by scain (Curate) on Dec 27, 2001 at 20:36 UTC
I would caution against using Excel and a correlation coefficent. The correlation coeffecient (often referred to as r^2) is relatively insensitive to variations in the data. A better measure of goodness of fit is chi squared, which you can calculate when you do a least squares fit. Check out Numerical Recipes chapter 15 for details on doing the calculations. Scott	[reply]