First part
Every time I have a while and go back to the problem, I am not able to think very much forward. Maybe there is nothing more to think about? So let's add few words and finish the topic.
In the first part I wrote "All letters 's' at the beginning of word are marked bold. Two matches. Ok, we take first one always.". It definitely has to be changed. What options do we have here?
- Penalty for not being used / bonus for being used - mostly used ones in first place
Pointed already by roboticus. Obvious. No question about it. If something has been used already - it is good and will be suggested to be reused. Don't think about pieces of toilet paper here...
- longest look-behind on the first place
Obvious, too. If there is look-behind match 9-chars-long, the place is definitely better candidate than the 1-char-long one.
- common part from all proposed strings
Finding common part even of few strings can be very difficult and I do not expect this to return any useful result.
- common part from some of proposed strings
however, is promising one. How to decide which subset should be chosen? Best 10 strings chosen with help of 1st or 2nd method? Maybe, like in statistic research, take all proposals and remove e.g. 30% edge cases? What are edge cases? (String::Approx and Text::Levenshtein comes to mind, but as the thing is to predict exact string, I don't know if it will be of any use) Or maybe sort alphabetically all strings and return the biggest group starting with the same letter?
- closest to the current position
This idea assumes that the paragraph/chapter is about something (e.g. 'unemployment') and similar words are near. See how definition of unemployment looks in wikipedia:
Unemployment is the state in which a person is without work, available to work, and is currently seeking work. The unemployment rate is used in economic studies and economic indexes such as the United States' Conference Board's Index of Leading Indicators.
Check the words: 'Unemployment', 'work', 'economic'.
And Yet Another Idea to consider during implementation: mark with digits 10 best candidates in text with possibility to use via Ctrl+<digit>, Ctrl+<left/right arrow> adds/removes parts of text (words), Enter accepts.
I was reading about LZW, Huffmann, Markov Chains, PPM, PAQ and other compression techniques suggested by you, guys. I don't exactly know how to use it to complement text. These algorithms operate on chars instead of words and this has made me think about following: suggesting one char is of no use. But what about suggesting 2 more chars every time user presses predicted char? Normal text at the beginning is what's already typed, bold is suggested(selection), and the rest light grayed to show what computer "has in mind".
u
nemployment
un
employment
une
mployment
unem
ployment
unemp
loyment
unempl
oyment
Almost 50% of typing saved... Or better, 2 chars at the beginning, 3 in the next step, etc.
u
nemployment
un
employment
une
mployment
unem
ployment
unemp
loyment
Well, so much for theory. I plan to start implementing it in a 2-3 weeks as I have now other important things to do, so please be patient :)