in reply to creating crystal ball

grizzley:

Sorry ... I couldn't wait for you to post the next chunk. It's an interesting question, and while reading it, I couldn't help thinking about LZW compression. I'm not sure if you're familiar with it or not, but I think you could adapt it to your problem, especially considering:

What is the goal? Having text of n characters I would like to type only k letters, where k aims to 0. Without any dictionary at the beginning.
...snip...
1. Word autocompletion. Application stores every typed word in some struct and lists matching words while I start typing another one.
...snip...
2. Context search. There are regexps with negative, positive assertions. Hey, why don't do contextual search and try to guess next word without typing any char?

Using an LZW-like table would allow you to construct your dictionary as you use it. So as you work on a project, it would continue to improve its autocompletion guesses. Since you're not worrying about compression/decompression, you can easily edit the table. If you add a timestamp to the nodes, you could even prefer the most-recently-used words in preference to the most common. (Thus, the variable '$food' in your current project could be preferred to '$foobar' that you used in your example programs yesterday.)

As far as regexps are concerned, if you use the LZW-table idea, you wouldn't use regexes at all. You'd instead build a state machine where the last "n" keystrokes would be encoded into the current state, and you could examine the edge transitions from the current state and propose words/phrases based on usage.

As far as how much to autocomplete: I'm thinking that with the LZW table, you'd default to autocomplete all the "common" edges until you hit a steep drop in probability. Then you could use gestures to alter the amount to autocomplete. Something like: space would autocomplete the current word, tab would autocomplete the current "best guess". (If you display more than one at a time, you could use the down arrow to discard the current best guess...) An shifted character could accept up to and including that character in the current best guess, and a normal unshifted character could be entered normally (causing your table to update probabilities, etc.).

Regarding resource usage: Memory is cheap, so you might choose a (reasonably large) number of edges as a target, and if your table reaches that size, you might cull 5-10% of the edges--those with the lowest probability.

Other ideas that come to mind:

Hmmm.... now I've got it percolating in the back of my head. If you decide to start implementing it with this table-based idea, let me know, and I'll jot down my ideas for the data structure and such. Be sure to keep us apprised of this project, it sounds fun!

Can I put just all possible variations ('it', 'it is', 'it is the', 'it is the string') to the cleverly constructed hash/array? Let's hope it won't slow down the thing awfully and won't eat all my RAM... But will it be really useful? Normally in the texts the same parts of text are not so long - few words at the most. Is there really a reason to look for 'let me finish this task for you and in this time you can do us both a tee'? I don't think so.

One thing I like about the node/edge table-driven approach is that since you're culling low-probability edges, high probability strings can build to be arbitrarily long. So if you were tasked with typing your example sentence a hundred times as punishment, the sentence would quickly become the next predicted text.

And of course is it acceptable if the application takes 2 minutes to process all the text only to show 20 letters long super clever autocompletion? Definitely not, in 2 minutes I can write many letters, a lot more than 20. This must work fast to be helpful.

Yep ... that's the sticking point. I'm a touch-typist, and long delays would irritate me to death. But with computers getting faster and faster, you can throw more power at it each year. It's an interesting problem ... I'm looking forward to seeing what you come up with.

...roboticus

Replies are listed 'Best First'.
Re^2: creating crystal ball
by salva (Canon) on Apr 21, 2008 at 13:59 UTC
    ... and while reading it, I couldn't help thinking about LZW compression. I'm not sure if you're familiar with it or not, but I think you could adapt it to your problem

    Funnily, the original post was making me think about Prediction by Partial Matching compression!

Re^2: creating crystal ball
by grizzley (Chaplain) on Apr 22, 2008 at 11:25 UTC
    Sorry ... I couldn't wait for you to post the next chunk. It's an interesting question...

    Good, that you didn't wait. I plan to enter next chunk in a few days rather than in a few hours. You have almost written that for me, because in a natural way thinking about the problem goes in direction 'probabilities', 'mostly used', etc. Let me answer this part in another meditation.

    ...and while reading it, I couldn't help thinking about LZW compression. I'm not sure if you're familiar with it or not...

    Yes, I know about compression algorithms, maybe not in detail, but overall technology. It will help storing those all possibilities, no doubt. Made me think about another approach: offer common 2-, 3-letter long hints, e.g. 'thr', 'in', 'st'. But it won't work. Either I'll end learning to stenotype (and learning new alphabet/set of symbols) or spending time accepting offered hints. I mean, what is quicker: to choose from 'is', 'if', 'in' or just to write letter 'i' and letter 's'?

    Something like: space would autocomplete the current word, tab would autocomplete the current "best guess".

    Space is not a good thing to accept hint while typing text full of spaces. Rather Tab and Enter. But it's detail and matter of personal preference. One can make shortcuts configurable.

    Regarding resource usage...

    I think it is not yet time for bothering about resources. You are right, memory is cheap, CPUs fast. A propos, I 'discovered' what efficiency is needed for final solution: it depends on user's typing speed. If user types 10 CPM(chars per minute), algorithm should run 10 times per second, for me it will be ca. 40 CPM, if there is any 100CPM magician willing to use it, it would be nice to have it working with frequency 100/s. Tough constraint...

    If you decide to start implementing it with this table-based idea, let me know, and I'll jot down my ideas for the data structure and such. Be sure to keep us apprised of this project, it sounds fun!

    You can be sure I won't keep it hidden :) I want to split the fun to following parts:

    • limit the problem, so that it fits in my head :) Much fun. It's almost finished.
    • implement some solution(s). More fun.
    • optimize it for speed! So much fun! Is anyone here who doesn't like this step? :)
    • golf it! Don't have to mention how much fun it is :)
    ('Project'? He called this a 'project'... I though I'll end up with one file, maybe 50 lines long?...)

    P.S. How do you, people, do it?!? I mean, writing nodes so quick! It takes time to preview and format correctly and you give almost immediate (and so long!) response! Admit, you had this answer prepared for a few weeks and just waited for proper question... :)

      grizzley:
      P.S. How do you, people, do it?!? I mean, writing nodes so quick! It takes time to preview and format correctly and you give almost immediate (and so long!) response! Admit, you had this answer prepared for a few weeks and just waited for proper question... :)

      Well ... you almost answered it yourself earlier...

      if there is any 100CPM magician willing to use it, it would be nice to have it working with frequency 100/s. Tough constraint...

      I first touched a computer (terminal) in high school, and immediately realized that I'd be using them for the rest of my life......so I took typing classes the very next semester. The large number of pretty girls in the class had no bearing on my decision! ;^) It was perhaps the most useful course I took.

      One thing about being a relatively fast typist, though...I can't *stand* it when an IDE tries to help me type or predict what I'm going to do next. There's a "flow-state" you're in when you touch-type, and making decisions on choosing text fragments keeps pulling you out of the flow, so it ultimately slows me down. So even if you're successful in this project, I still won't be able to use it... ;^P

      ...roboticus
        I tried to search google for some stats on how fast people type. I found this typing test and it seems that for simple text without '[]{}$!&%^#' even I am much faster than 100CPM. So the constraint becomes at least 300/s.
Re^2: creating crystal ball
by ambrus (Abbot) on Apr 23, 2008 at 07:57 UTC

    Your point about processing speed is right. However, even if your computer is very fast, you have to take into account the speed the user can think. If the autocompletion is not predictable, and the user keeps having to watch the screen for whether the completion is right, you lose lots of speed.

    Also, compression reminds me to that christmas carol.

      You are right when one person reads text from a piece of paper and writes. It will help only in case: "Hey, let's split the work: read it loud, and I'll write. This way we will finish faster." (like in the joke: why policemen always go two by two? Because always one of them can read, and the other one can write)