I doubt you could do what you want to do with a system like those used for spam detection. Such systems are typically naive bayes classifiers. They are naive because they assume all variables in the analysis are conditionally independent. For example, when comparing text a naive bayes system assumes that each word in the text is independent of every other word. This is obviously a completely bogus assumption but experimental results (and real world experience) shows us that textual classification doesn't suffer as a result of this assumption. This doesn't just apply in the case of spam, naive bayes classifiers have been trained to categorize posts into newsgroups and the like (see Mitchell, section 6.10). The problem is that your problem is much more difficult than the text classification problem and I'd expect this assumption to be much more damaging to your task.
There are other bayesian methods such as optimal bayes classifiers and bayesian networks. The Mitchell book noted above gives a nice overview. Note that all these methods are based on assigning probability to different hypotheses. You have a continous hypothesis space which makes this difficult. You could discretize your hypothers space by creating a finite discrete set of price ranges or by attempting to predict if the price will rise, fall, or stay the same. The other problem is that optimal classifiers and bayesian networks are costly to train. I'm not even sure there is a polynomial time algorithm for training bayesian networks. These techniques are closely related to maximum likelihood estimation, markov chain monte carlo, and other bayesian techniques commonly used in a variety of fields. This stuff is pretty hard-core and extemely processor-intensive. I work with a political scientist who does bayesian analysis of the supreme court and his simulations can run for weeks on an openmosix cluster of high-end machines. The simulations are written using a c++ library and I've spent a great deal of time optimizing it. This is a domain where an interpreted language like Perl simply doesn't shine. Honestly, we'd be better off in terms of speed in c (or better yet but god forbid, fortran) but we're trying to strike a balance between efficiency and ease of use in the library.
You are attempting to tackle a very hard problem. Assuming there are recognizable patterns in your data (the stock market is highly volatile but market prices of goods are a bit easier to predict) the patterns will likely be highly non-linear as a function of the imputs and your data will be incredibly noisy. Certain types of neural networks may fit your problem; they handle continuous inputs and outputs well and recurrent versions can deal with time series. Traditional econometric time-series techniques might work as well assuming your problem ends up being at least approximately linear. Essentially, what I'm saying is that choosing a learning/classification technique is going to be your biggest problem. You may have to try a few different techniques and tweak them extensively before you get anything resembling an accurate prediction. Implementation is downright trivial in comparison.
In reply to Re: Bayesian not-for-spam
by dbp
in thread Bayesian not-for-spam
by Kickstart
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |