Challenge: Predictive Texting

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Challenge: Predictive Texting by Not_a_Number (Prior) on Jan 10, 2007 at 19:26 UTC
Right, I think I understand the rules now. :-) A subsidiary question: are we allowed to sort our datastructure by frequency of paragrams (or 'textonyms' as I learn that they are also called from the wikipedia page you link to)? If so, does anybody know of a freely available list of word frequencies in US English? (A good UK English resource is this site, which uses the British National Corpus). Come to think of it, the answer to this probably depends on yet another subsidiary question that I have: will the mystery text* consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)? Looking through 2of12.txt, I see that it is extremely poor in inflected forms (plurals, past tenses...) - even 'lips', which you use in ~~several~~ a couple of your examples above, is not included - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list. * Note that 2of12.txt contains few or no UK English variant spellings (no 'colour', 'criticise', 'manoeuvre'...). BTW, how should we parse 'between 3 and 5 thousand': 'between 3 and 5000', or 'between 3000 and 5000'? </nitpick> Update:** PS, I forgot to add this. Thanks once again for an interesting, thought-provoking challenge. Limbic~Region++!	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 19:55 UTC
Not_a_Number, ...are we allowed to sort our datastructure by frequency of paragrams.. Yes. In fact, the reason the mystery text remains secret is so this technique is not applied to just that text skewing the results. If so, does anybody know of a freely available list of word frequencies in US English? I am fairly certain I came across one this morning when researching but can't be sure that it was US English. will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)? More or less US English prose. ... - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list. You are quite correct. The 2of12inf.txt does a much better job in this area. On the other hand, if an entire book can be written without using the letter e in two different languages, I am sure that it will not be too difficult to provide mystery text between 3000 and 5000 words that meet the constraints. Thanks once again for an interesting, thought-provoking challenge. You're welcome. Cheers - L~R	[reply]
Re^3: Challenge: Predictive Texting by rhesa (Vicar) on Jan 11, 2007 at 15:01 UTC
On the other hand, if an entire book can be written without using the letter e in two different languages, (...) That would be the famous book by Georges Perec: A_Void (originally "La disparition").	[reply]
Re^4: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 11, 2007 at 15:51 UTC
Re: Challenge: Predictive Texting by ikegami (Patriarch) on Jan 10, 2007 at 18:01 UTC
All entries must support the same API to be considered. So what's the API? Specifically: Digits will be passed in one at a time. To what are the digits passed? To a program? To a process? To a function? How are the digits passed? By argument? As a character on STDIN? As a line on STDIN? The expected return value will be... How is the value returned? Function return? STDOUT with newline? Update: Oops, I see that you've already partially answered this, and the rest can be deduced from that answer. (function/method, function argument, function return)	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 18:12 UTC
ikegami, It never ceases to amaze me how much I assume. I regularly post challenges and interesting problems and inevitably there is confusion despite my best efforts at being clear and succinct. In any case, I hope things make sense now. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by BrowserUk (Patriarch) on Jan 10, 2007 at 18:22 UTC
The problem text to solve is secret It will contain between 3 and 5 thousand words No punctuation except a "space" 3000 to 5000 words with no indication of end/start of sentence is one hell of a sentence. Predicting the next word could be much more accurate if you know it is the first word of a sentence or the second etc. ? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 18:58 UTC
BrowserUk, I agree and think that would be a great improvement to current predictive texting techniques - at least the ones I have seen. That is a layer of complexity I do not want to add into the challenge for the first round but perhaps will follow it up. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by GrandFather (Saint) on Jan 10, 2007 at 23:34 UTC
A quick (in the sense of not much time spent on it) first cut version before I head off on holiday. :) use strict; use warnings; use constant LEARN => 1; # Set true to learn by doing open WORDS, '<', '2of12.txt'; my @words = <WORDS>; close WORDS; chomp @words; @words = map {[$_, asSMS ($_)]} @words; my $lookup = bless {lookup => {}, nextWordNIndex => 0}; $lookup->addWord ($_) for @words; $lookup->training (); $lookup->{scan} = $lookup->{lookup}; my %dict = map {($_->[0] => $_->[1])} @words; my $strokes = 0; my $characters = 0; my $txtWords = 0; while (<DATA>) { chomp; my @dataWords = split; for my $word (@dataWords) { $word =~ tr/a-z//cd; next if ! exists $dict{$word}; ++$txtWords; my @digits = @{asSMS ($word)}; my @pressed; $characters += @digits; while (1) { my $digit = (shift @digits) \|\| '-'; my $result = $lookup->addDigit ($digit); ++$strokes; push @pressed, $digit; if ($result eq $word) { $lookup->addDigit ('+'); ++$characters; ++$strokes; push @pressed, '+'; last; } } print "@pressed ($word)\n"; } } print "$strokes strokes to send a $characters character message contai +ning $txtWords words.\n"; sub addWord { my ($self, $word) = @_; my @digits = @{$word->[1]}; my $text = $word->[0]; my $scan = $self->{lookup}; for my $digit (@digits) { if (defined $scan->{$digit}) { push @{$scan->{$digit}{words}}, [$text, 0]; } else { $scan->{$digit} = {words => [[$text, 0]]}; } $scan = $scan->{$digit}; } } sub addDigit { my ($self, $digit) = @_; if ($digit eq '+') { $self->train ($self->{lastGuess}[0], 1) if LEARN; $self->doneWord (); return $self->{lastGuess}[0]; } elsif ($digit eq '-') { if (! exists $self->{scan}{wordsN}) { @{$self->{scan}{wordsN}} = grep {$self->{wordlen} == length $_->[0]} @{$self->{sc +an}{words}}; } die "Unknown word @{$self->{digits}}" if $self->{nextWordNIndex} == @{$self->{scan}{wordsN}}; # Skip words that have been tried already my $wordsN = $self->{scan}{wordsN}; ++$self->{nextWordNIndex} while exists $self->{tries}{$wordsN->[$self->{nextWordNInd +ex}][0]}; $self->{lastGuess} = $wordsN->[$self->{nextWordNIndex}] } else { die "Unknown word @{$self->{digits}} $digit" if ! exists $self->{scan}{$digit}; push @{$self->{digits}}, $digit; $self->{scan} = $self->{scan}{$digit}; ++$self->{wordlen}; my $index = 0; my $words = $self->{scan}{words}; # Skip words that have been tried already ++$index while exists $self->{tries}{$words->[$index][0]}; $self->{lastGuess} = $words->[$index] } $self->{tries}{$self->{lastGuess}[0]} = 1; return $self->{lastGuess}[0]; } sub train { my ($self, $word, $weight) = @_; my @digits = @{asSMS ($word)}; my $wordlen = length $word; $weight \|\|= 1; $weight += $wordlen / 10.0; # Sort the word lists by frequency my $scan = $self->{lookup}; for my $digit (@digits) { $scan = $scan->{$digit}; my @match = grep {$_->[0] eq $word} @{$scan->{words}}; last if ! @match; $_->[1] += $weight for @match; @{$scan->{words}} = sort {$b->[1] <=> $a->[1]} @{$scan->{words +}}; } # Sort the wordN lists by frequency $scan = $self->{lookup}; $scan = $scan->{$_} for @digits; if (! exists $scan->{wordsN}) { @{$scan->{wordsN}} = grep {$wordlen == length $_->[0]} @{$scan->{words}}; } my @match = grep {$_->[0] eq $word} @{$scan->{wordsN}}; $_->[1] += $weight for @match; @{$scan->{wordsN}} = sort {$b->[1] <=> $a->[1]} @{$scan->{wordsN}} +; } sub doneWord { my $self = shift; $self->{scan} = $self->{lookup}; $self->{digits} = (); $self->{wordlen} = 0; $self->{nextWordNIndex} = 0; $self->{tries} = (); } sub asSMS { my $word = shift; my @SMSWord; $word =~ tr /a-z//dc; for (split //, $word) { if (/[abc]/) { push @SMSWord,'2'; } elsif (/[def]/) { push @SMSWord,'3'; } elsif (/[ghi]/) { push @SMSWord,'4'; } elsif (/[jkl]/) { push @SMSWord,'5'; } elsif (/[mno]/) { push @SMSWord,'6'; } elsif (/[pqrs]/) { push @SMSWord,'7'; } elsif (/[tuv]/) { push @SMSWord,'8'; } elsif (/[wxyz]/) { push @SMSWord,'9'; } } return \@SMSWord; } sub training { my $self = shift; my %freqs = ( i => 1, an => 1, no => 1, so => 1, by => 1, at => 1, has => 1, far => 1, see => 1, key => 1, who => 1, but => 1, are => 1, each => 1, used => 1, with => 1, here => 1, only => 1, same => 1, been => 1, pure => 1, below => 1, might => 1, could => 1, value => 1, enter => 1, solve => 1, rules => 1, minus => 1, space => 1, pearl => 1, number => 1, winner => 1, fewest => 1, reduce => 1, secret => 1, amount => 1, chosen => 1, except => 1, resume => 1, return => 1, texting => 1, presses => 1, thought => 1, correct => 1, contain => 1, support => 1, fastest => 1, problem => 1, between => 1, present => 1, category => 1, thousand => 1, returned => 1, provided => 1, expected => 1, suggested => 1, incorrect => 1, lowercase => 1, technique => 1, challenge => 1, interface => 1, following => 1, algorithm => 1, submitted => 1, solutions => 1, implement => 1, suggestion => 1, dictionary => 1, authorized => 1, completion => 1, considered => 1, keystrokes => 1, predictive => 1, programming => 1, interesting => 1, punctuation => 1, information => 1, application => 1, if => 2, one => 2, and => 2, for => 2, that => 2, time => 2, text => 2, plus => 2, best => 2, digit => 2, being => 2, means => 2, digits => 2, entries => 2, instead => 2, spelled => 2, finished => 2, must => 3, words => 3, passed => 3, of => 4, it => 4, in => 4, word => 4, to => 5, all => 5, is => 7, will => 7, a => 8, be => 9, the => 12, ); if (! LEARN) { # Cheat by knowing the word frequencey up front $self->train ($_, $freqs{$_}) for keys %freqs; } } __DATA__ all predictive texting is a technique to reduce the number of key presses +used to enter text with i thought it might be an interesting challenge to see +who could implement the best algorithm here are the rules the problem text to solve is secret it will contain between and thousand words all words will be present in only authorized dictionary all words will be lowercase no punctuation except a space solutions must be pure pearl all entries must be submitted by one winner for each category below will be chosen fewest keystrokes fastest completion time all entries must support the same application programming interface to + be considered digits will be passed in one at a time the expected return +value will be the best suggestion for the amount of information provided so far i +f instead of a digit a plus is passed in it means that the word is finished bein +g spelled and the correct word has been returned if instead of a digit a minus i +s passed in it means that the word is finished being spelled but the suggested +word is incorrect digits will resume following a plus [download] Using an early version of the OP's text as the message text I get the following result: `844 strokes to send a 901 character message containing 170 words.` [download] Note that the 2of12.txt file is expected to reside in the current directory. Updated to count '+' as strokes and part of message text - they effectively turn into spaces Updated to take advantage of "knowing" the word frequency in the test text up front. Strokes reduce to 540! This version prints the strokes used for each word. Set LEARN true to have the code learn the word frequency as the text is supplied. In this case the stroke count increases to 810 (there's a slight algorithm improvement on the first version too;) ). DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re: Challenge: Predictive Texting by demerphq (Chancellor) on Jan 11, 2007 at 09:33 UTC
I hate predictive texting and T9. I would much prefer that the phone companies release a phone with the letters properly huffman coded onto the buttons. The fact that I have to press a button twice for some of the most common letters in common English usage annoys me no end. WTF are 'P' and 'W' doing as a single press letters? Even something where I got to select morphemes based on huffman coding would be much nicer. 'th' 'er', 'ere', 'ing' would go a long way to make things easier, and would be much more consistant and use less phone resources than this crappy AI related stuff that will never work properly. People can learn things like this pretty easy (look how fast kids get at texting given the absolute crap designs out there.) --- $world=~s/war/peace/g	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 11, 2007 at 13:07 UTC
demerphq, I am not a big fan of predictive texting either. Actually, I don't even own a cell phone. I didn't let that deter me from an interesting challenge though. It looks like one follow-on challenge then will be to propose a new keypad layout to see how well your alternative(s) match up. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by perrin (Chancellor) on Jan 10, 2007 at 15:16 UTC
Isn't it pretty much the same problem that all of the JavaScript autocomplete backend scripts solve? There are a few of them on CPAN.	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 16:05 UTC
perrin, Isn't it pretty much the same problem that all of the JavaScript autocomplete backend scripts solve? Not exactly, auto-completion is a technique to predict what you are going to type in the future based off what you have already typed. I did a bit of research this morning before posting and there are a number of papers out there regarding predictive texting - so yes, it is a solved problem. That shouldn't deter you from the fun of the challenge. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by BrowserUk (Patriarch) on Jan 10, 2007 at 16:39 UTC
If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'. Does that mean that after a '-' is passed in, you'll pass a '+' before resuming passing digits? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 17:22 UTC
BrowserUk, The AM is correct. If for instance: `<previous input> Input: 7 Output: kiss Input: - Output: lips Input: + Output: lips <repeated> Input: <digit>` [download] Cheers - L~R	[reply] [d/l]
Re^3: Challenge: Predictive Texting by Not_a_Number (Prior) on Jan 10, 2007 at 17:37 UTC
Sorry, but I'm still not clear what 'the word is finished being spelled' means. Could it be paraphrased, for example, by 'the suggested word is the same length as the intended word?	[reply]
Re^4: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 17:51 UTC
Re^5: Challenge: Predictive Texting by ambrus (Abbot) on Jan 11, 2007 at 12:16 UTC
Some notes below your chosen depth have not been shown here
Re^2: Challenge: Predictive Texting by Anonymous Monk on Jan 10, 2007 at 16:46 UTC
Each time you pass '-', the program returns another guess, until '-','+', when more letters/digits are added.	[reply]
Re: Challenge: Predictive Texting by GrandFather (Saint) on Jan 10, 2007 at 19:57 UTC
I notice that some of the words in the 2of12inf.txt file have trailing % characters. What are they for? DWIM is Perl's answer to Gödel	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 20:04 UTC
GrandFather, From the readme - Plurals of "uncountable" nouns were included, annotated with the "%" suffix character. See below for an extended discussion of the inclusion of these words. Keep in mind that 2of12.txt ne 2of12inf.txt though I am thinking for add-ons to the challenge, UK english as well as inflections will add to the complexity though the techniques should remain relatively the same. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by Anonymous Monk on Jan 10, 2007 at 15:31 UTC
Which file contains 2of12.txt?	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 16:12 UTC
Anonymous Monk, Great question, it is inside this zip file. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by halley (Prior) on Jan 10, 2007 at 19:06 UTC
Isn't T9 a commercially available form of this, on a word-by-word basis? Are you trying to find something that is faster than T9, or just an open implementation of a T9-alike? -- `[ e d @ h a l l e y . c c ]`	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 19:12 UTC
halley, My wife kept me awake last night texting her friends and family. Since I couldn't sleep, I began wondering about the algorithm(s) used to make predictive texting work. I decided I would post it as a challenge today since there are several monks that enjoy interesting distractions. I don't have any other purpose for it than that. Cheers - L~R	[reply]
Re: Challenge: Predictive Texting by BrowserUk (Patriarch) on Jan 11, 2007 at 14:06 UTC
It's not clear to me yet how the user would indicate that they have entered the full spelling of a word that does not exist in the algorithms lexican? Or, to put it another way, how does the algorithm indicate that it has no more alternatives when the user is indicating that they've entered enough digits to spell the word, but have yet to be offered the correct prediction? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 11, 2007 at 14:10 UTC
BrowserUk, ...how does the algorithm indicate that it has no more alternatives when the user is indicating that they've entered enough digits to spell the word, but have yet to be offered the correct prediction, and the algorithm has no further alternatives? All words will be present in 2of12.txt (only authorized dictionary) If your algorithm fails to present the word that is in the dictionary provided it fails completely. This is to simplify things as was the removal of punctuation, adaptive learning, determining intended word by position in sentence, etc. I wanted to start with the bare essentials since I was only giving a week time frame and most people do have jobs. Cheers - L~R	[reply]
Re^3: Challenge: Predictive Texting by GrandFather (Saint) on Jan 11, 2007 at 19:44 UTC
What do you mean "removal of ... adaptive learning"? Have I to go back and rewrite my submission! DWIM is Perl's answer to Gödel	[reply]
Re^4: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 11, 2007 at 19:50 UTC
Re: Challenge: Predictive Texting by Anonymous Monk on Jan 10, 2007 at 15:22 UTC
All entries must support the same API to be considered. Please define the API, preferably in perl.	[reply]
Re^2: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 16:10 UTC
Anonymous Monk], Please define the API, preferably in perl. To quote myself: Digits will be passed in one at a time. The expected return value will be the best "suggestion" for the amount of information provided so far. If instead of a digit, a '+' is passed in, it means that the word is finished being spelled and the correct word has been returned. If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'. Admittedly, I should have said "All entries must provide a similar API to be considered." In other words, your solution must provide a mechanism for accepting input and providing output as described above. Cheers - L~R	[reply]
Re^3: Challenge: Predictive Texting by Anonymous Monk on Jan 10, 2007 at 16:25 UTC
What is the mechanism for accepting input/providing output?	[reply]
Re^4: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 10, 2007 at 17:26 UTC
Re: Challenge: Predictive Texting by Anonymous Monk on Jan 10, 2007 at 16:28 UTC
`my %keyb = ( 2 => [qw[a b c]], 3 => [qw[d e f]], 4 => [qw[h i j]], 5 => [qw[k l m]], 6 => [qw[m n o]], 7 => [qw[p q r s]], 8 => [qw[t u v]], 9 => [qw[w x y z]], );` [download]	[reply] [d/l]
Re^2: Challenge: Predictive Texting by diotalevi (Canon) on Jan 10, 2007 at 16:42 UTC
Isn't `'#' => [' ']` missing there? ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊	[reply] [d/l]
Re^3: Challenge: Predictive Texting by Anonymous Monk on Jan 10, 2007 at 20:56 UTC
From the update, Update (2007-01-10 12:23 EST): A zero '0' will be used to indicate a space	[reply]
Re^4: Challenge: Predictive Texting by Limbic~Region (Chancellor) on Jan 11, 2007 at 13:37 UTC