Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Seeking abnormalities in data sets.

by ehdonhon (Curate)
on Dec 27, 2001 at 00:43 UTC ( [id://134480]=perlquestion: print w/replies, xml ) Need Help??

ehdonhon has asked for the wisdom of the Perl Monks concerning the following question:

Holiday Greetings Monks!

I have an interesting situation, and I'm hoping that I can find a very intuitive solution by re-using code rather than writing my own hacked up code.

I have a situation where I need to analyze about 40,000 unique sets of data on a daily basis. My job is to take each set of data (comprised of many time, value pairs) and look for inconsistencies within that data set. The data might be linear or exponential (if exponential, it should be an always increasing or always decreasing slope,) and the magnitude of the values is irrelevant, unless there is a drastic change in magnitude at some point in the data. I analyze each data set separately, so the only relevance to having 40,000 sets to look at is that it can not be to slow.

I guess what I'm looking for is something that can take a whole bunch of (x,y) pairs and try to fit that data to some sort of line or constantly /(increasing)|(decreasing)/ curve, and then let me know if there were any points that fell outside of the given margine of error.

That probably sounds like a very specific problem, but as I recall from my statistics classes (long, long ago), it comes up quite frequently, so I'm hoping that somebody knows about something that might come close to doing something like this for me.

Thanks in advance!

Replies are listed 'Best First'.
Re: Seeking abnormalities in data sets.
by clintp (Curate) on Dec 27, 2001 at 01:37 UTC
    In Orwant's book, Mastering Algorithms with Perl, at the end of Chapter 15 (Statistics) he talks about finding a "best-fit" straight line (linear least squares, regression line) to a set of data points -- a standard y=bx+a kind of thing from HS Algebra.

    Once you have that for a given range of points, I'd think it would be a small matter to find the (correlation coefficient, r-to-t transformation) rogues using the distance from that line to a given point in the set to see if any single point was really whacked out.

    Since I'm not willing to re-type the subroutines here, use the parenthetical terms above in a search engine to find a good algorithm you can transcribe to Perl. Or the MAP examples may be online somewhere at ORA (as they are for the Cookbook and other ORA publications).

    update: Fixed attribution.

Re: Seeking abnormalities in data sets.
by termix (Beadle) on Dec 27, 2001 at 01:24 UTC

    If I understand your problem correctly, you wish to do curve fitting in PERL using large ammounts of data and then detect excepts that are identified. Yes, there are a number of statistical methods to accomplish that (which I know very little about).

    • The statistical modules for PERL might help. Check them out here. (or try the Math modules here).
    • I know there is a book that talks about specific curve fitting examples. Ah, here it is. I believe if you know the math behind the subject, then you can create your own algorithm with the help of this book (and contribute to CPAN!).
    • May be you don't have to do the curve fitting in PERL. You could use a statistics package for that if there is one that you use and have code/scripts already written for. PERL can be your data parsing and results presentation system and coordinate the work of the curve fitting program.

    -- termix

Re: Seeking abnormalities in data sets.
by scain (Curate) on Dec 27, 2001 at 02:47 UTC
    Do you know that your data sets will always be either (a) linear (ie, y= mx +b) or (b) exponential (y = Aexp(Bx))? If that's the case, then you should be able to use linear least squares as suggested above. Since a and b are separate issues, you would have to try both, and deside on a case by case basis which is better.

    Also, in the case of b, you can convert it to a linear problem by taking the log of y and plotting that against x. (At least that feels right at the moment... log(y) = log(A) + Bx... yeah, that's it).

    If your data could be of other forms, like higher order polynomials, then you will have to try all options, and it would turn into a slow mess, since you would have to try all of them for any given set.

    Good luck,
    Scott

Re: Seeking abnormalities in data sets.
by toma (Vicar) on Dec 27, 2001 at 09:10 UTC
    For your task PDL will be enormously valuable. It will be worth every bit of effort to obtain and learn it.

    There is the nice PDL::Fit::Linfit module which does a general curve-fit to a linear combination of specified functions.

    PDL also has functions for selecting the inconsistent data, creating plots, and generating statistical summaries.

    The PDL module is a perl extension written in C and FORTRAN. In my experience it is many times faster than the equivalent routines written in pure perl. It should be quick with 40,000 point datasets.

    It should work perfectly the first time! - toma

Re: Seeking abnormalities in data sets.
by newbie00 (Beadle) on Dec 27, 2001 at 03:36 UTC
    Hello.

    First, in order to properly analyze your data, you must know within an acceptable level of confidence, that the model you are using is the appropriate model, be it linear, exponential, or other. For example, you can use the correlation coefficient for e.g. the linear model to determine if enough of the error can be explained by that model to provide you with enough confidence that the correct model is being used (see a statistics book that contains linear and non-linear regression techniques).

    Without getting into too much detail, as a crude method, say for instance, if you don't have the background to analyze the data to the necessary degree, if you have a 'target' value for each point (e.g. in time, or other), and say, you don't want to accept data more than say, +/- 3%, you can calculate a 'band' around that 'target' data. Then you can plot your actual data along with these bands (you will have 3 curves using point-to-point vs. fitting a regression, especially if you don't have either the tools or background necessary to determine the actual regression model each time you collect the 40,000 data points) and visually look at the data. If the actual data falls outside of this band, then you may want to look at that particular data point a little closer. That does not mean automatically exclude it, unless you have enough info to support excluding it. This method is again, considered 'crude'.

    You can use e.g. Microsoft Excel to import your data into (e.g. using a comma-delimited format for the data, which you can get your Perl program can create for you; you can calculate your bands either within Excel very easily (preferred to keep imported filesize to a minimum) for plotting and/or analysis. This software has statistical routines built-in. Plus, there is a book called, "Microsoft Excel 2000 Formulas" by John Walkenbach (ISBN 0-7645-4609-0) that may provide you with more info for that software. Of course, there are other stats books you can use with this software.

    Be cautious in using crude methods -- what I mean is, don't try to read too much into the results. These types of methods are many times used to provide you with a 'direction', not conclusions.

    Hope some of this helps.

    Regards. --newbie00

      Hello!

      Well, IMHO what newbie00 said - calculating acceptable band for analysed values, is used rather in situations when you have an exact 'middle' value, which is the standard and required one. BTW - this is used in quality management.

      If the model is linear this is ok, but in other case the only way to decide if everything went ok, is to prepare the curve computed by an exact (expected) model, stimate acceptable difference and compare those two curves ;)

      IMHO Excel is only a workaround to visualize data, but AFAIK Excel it _will_not_ cooperate so easy with anything else than Execl itself :-( Besides, if you are using execl, you have to decide yourself whether there are any abnormalities using a graph... Ha! So why do i need a perl program?! ;-> If you don't need a graph, why do you use Excel? Excel is also some solution; everything depends on what you _really_ need.

      Best regards to everyone here. --tmiklas
      I would caution against using Excel and a correlation coefficent. The correlation coeffecient (often referred to as r^2) is relatively insensitive to variations in the data. A better measure of goodness of fit is chi squared, which you can calculate when you do a least squares fit. Check out Numerical Recipes chapter 15 for details on doing the calculations.

      Scott

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://134480]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (2)
As of 2024-04-26 05:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found