creating crystal ball 2

First part

Every time I have a while and go back to the problem, I am not able to think very much forward. Maybe there is nothing more to think about? So let's add few words and finish the topic.

In the first part I wrote "All letters 's' at the beginning of word are marked bold. Two matches. Ok, we take first one always.". It definitely has to be changed. What options do we have here?

Penalty for not being used / bonus for being used - mostly used ones in first place
Pointed already by roboticus. Obvious. No question about it. If something has been used already - it is good and will be suggested to be reused. Don't think about pieces of toilet paper here...
longest look-behind on the first place
Obvious, too. If there is look-behind match 9-chars-long, the place is definitely better candidate than the 1-char-long one.
common part from all proposed strings
Finding common part even of few strings can be very difficult and I do not expect this to return any useful result.
common part from some of proposed strings
however, is promising one. How to decide which subset should be chosen? Best 10 strings chosen with help of 1st or 2nd method? Maybe, like in statistic research, take all proposals and remove e.g. 30% edge cases? What are edge cases? (String::Approx and Text::Levenshtein comes to mind, but as the thing is to predict exact string, I don't know if it will be of any use) Or maybe sort alphabetically all strings and return the biggest group starting with the same letter?
closest to the current position
This idea assumes that the paragraph/chapter is about something (e.g. 'unemployment') and similar words are near. See how definition of unemployment looks in wikipedia:
Unemployment is the state in which a person is without work, available to work, and is currently seeking work. The unemployment rate is used in economic studies and economic indexes such as the United States' Conference Board's Index of Leading Indicators.
Check the words: 'Unemployment', 'work', 'economic'.

And Yet Another Idea to consider during implementation: mark with digits 10 best candidates in text with possibility to use via Ctrl+<digit>, Ctrl+<left/right arrow> adds/removes parts of text (words), Enter accepts.

I was reading about LZW, Huffmann, Markov Chains, PPM, PAQ and other compression techniques suggested by you, guys. I don't exactly know how to use it to complement text. These algorithms operate on chars instead of words and this has made me think about following: suggesting one char is of no use. But what about suggesting 2 more chars every time user presses predicted char? Normal text at the beginning is what's already typed, bold is suggested(selection), and the rest light grayed to show what computer "has in mind".

unemployment
unemployment
unemployment
unemployment
unemployment
unemployment

Almost 50% of typing saved... Or better, 2 chars at the beginning, 3 in the next step, etc.

unemployment
unemployment
unemployment
unemployment
unemployment

Well, so much for theory. I plan to start implementing it in a 2-3 weeks as I have now other important things to do, so please be patient :)

Comment on creating crystal ball 2

Replies are listed 'Best First'.
Re: creating crystal ball 2 by blokhead (Monsignor) on May 19, 2008 at 22:04 UTC
I googled for "text prediction" and found soothsayer (homepage) as one of the first hits. Sounds like a programming library that does exactly what you're looking for. Another thing that comes to mind is Dasher, although its novelty is not the text-prediction but its accessible interface. However, its text prediction seems good, it can be trained, and it is fast in real-time. It probably uses some hidden Markov mechanism for prediction. Since it is open source, the prediction mechanism can probably be factored out (if it isn't already) and adapted to your interface needs. Even if you are determined to do things yourself (yikes), you can at least learn a lot by seeing how the others do it. blokhead	[reply]
Re^2: creating crystal ball 2 by grizzley (Chaplain) on May 20, 2008 at 12:19 UTC
Pity, everything is already written... Yeah, of course I'll try this at home. If it will work fine, it may effectively discourage me to write my own piece of Tk code. Trying Dasher online was really interesting, especially connected with speech. Little slow, but interesting.	[reply]