|Keep It Simple, Stupid|
Re: Finding Patternsby BrowserUk (Patriarch)
|on Sep 23, 2004 at 01:32 UTC||Need Help??|
The problem with your question as asked is that there is no way to instruct a computer to "search data and find information". You have to instruct it what the data looks like, and what would constitute a pattern.
For example: You could inspect a file containing text and conclude: It contains a series of numbers between 1 and 255. Or between 0 and 65535. Or 0 and ...; or it contains letters and numbers and spaces; or words; or sentences.
The point is that at every level, you the programmer have already applied some knowledge to the algorithm. But even then, all you have derived from your data, is information--the number of characters; their frequencies etc.--it still requires you the user of the program to apply further heuristics to achieve your goal and derive knowledge.
About the best you can do when it comes to deriving (unforseen) knowledge from data, it to make some guesses about what might be hidden in the data, and use a program to process and display that data in such a way, that the hidden knowledge becomes easily apparent.
An example of this is testing the output of PRN generators for patterns and correlations. One way is a method called spectral testing. (See Re: Testing for randomness (spectral testing) for a crude example of this). This technique can be applied to many types of data.
In your example of analysing a players chess strategy, you had a tentative idea that maybe it would possible to detect if a player tended to castle as early as possible in a game. You might test this theory by plotting his games on a spectral grid and looking for "clumping". To do this, you would need to derive 2 numerical values (or 3, but lets start simple:).
For each game, you derive your set of pairs of numbers and plot them onto the grid using one as the X-coordinate and the other as the Y. You then increase the numeric value of the colour plotted at the corresponding point. As you plot more and more games, the intensity of the colour at various points on the plot will increase. If the whole plot tends to increase evenly, you have essentially proved that there is no correlation between the two variables you have plotted. If the colours tend to clump at various points on the overall plot, then you may have discovered some level of correlation. It's then up to you to decide what if any significance that correlation has.
You might for instance plot the move number along one axis; and the type of move made along the other. The problem then becomes how to categorise the moves (into an integer). You might choose to categorise castling as 1; a non-attacking move (a move that does not place an opponents piece uneder attack) as 2; an attacking move as 3; a trade of pieces accepted as 4; a trade of pieces declined as 5; and so on. This might lead to the conclusion that the player prefers to start out with a land-grab before going on the attack. Or that they like to trade pieces whenever possible; etc.
So, there are three parts to the problem:
The last part is by far the hardest. For example, you can analyse a writer's output by plotting word pairs and word triples. With enough input, the spectral patterns can become indicative of authorship; but in and of themselves, such plots tell you very little about the author.
There is still too much background noise in the data to be able to tell anything about their education or origins etc. For example, There are certain pairs and triple of words that crop up with a high frequency regardless of the writer. To remove this noise from the plot, it becomes necessary to "normalise" the plot. One way of doing this is to start out with a plot of a mid-range colour (for 8-bit colour, you start by setting every pixel to 128). You then process a whole range of inputs from many authors and types of text, but instead of incrementing each pixel. you decrement it. This becomes your normalised background.
You can then use this on a batch of samples of your author's work, this time incrementing instead of decrementing. Where common pairs and triples occur in the authors work, they will tend to return the appropriate pixels back to the mid-range colour. Where he uses unusual pairings of words, the pixels will tend to lift the pixel from the mid-range. By applying a high pass filter to the results, you will tend to isolate the authors particularities from the background. You may then have the basis for further linguistic analysis.
With the right equipement and/or software, this type of analysis can 'discover' correlations where they would otherwise be very hard to detect. The nice thing about it is that it can be relatively trivial (and fast) to perform an analysis of many different pairings of variables and just see what shows up. The downside is that once you think you have detected a correlation, it then falls to human ingenuity to try and apply some reasoning to it:)