Re: Challenge: Predictive Texting
by Not_a_Number (Prior) on Jan 10, 2007 at 19:26 UTC
|
Right, I think I understand the rules now. :-)
A subsidiary question: are we allowed to sort our datastructure by frequency of paragrams (or 'textonyms' as I learn that they are also called from the
wikipedia page you link to)?
If so, does anybody know of a freely available list of word frequencies in US English*? (A good UK English resource is this site, which uses the British National Corpus).
Come to think of it, the answer to this probably depends on yet another subsidiary question that I have: will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)?
Looking through 2of12.txt, I see that it is extremely poor in inflected forms (plurals, past tenses...) - even 'lips', which you use in several a couple of your examples above, is not included - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list.
* Note that 2of12.txt contains few or no UK English variant spellings (no 'colour', 'criticise', 'manoeuvre'...).
** BTW, how should we parse 'between 3 and 5 thousand': 'between 3 and 5000', or 'between 3000 and 5000'? </nitpick>
Update: PS, I forgot to add this. Thanks once again for an interesting, thought-provoking challenge. Limbic~Region++!
| [reply] |
|
|
Not_a_Number,
...are we allowed to sort our datastructure by frequency of paragrams..
Yes. In fact, the reason the mystery text remains secret is so this technique is not applied to just that text skewing the results.
If so, does anybody know of a freely available list of word frequencies in US English?
I am fairly certain I came across one this morning when researching but can't be sure that it was US English.
will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)?
More or less US English prose.
... - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list.
You are quite correct. The 2of12inf.txt does a much better job in this area. On the other hand, if an entire book can be written without using the letter e in two different languages, I am sure that it will not be too difficult to provide mystery text between 3000 and 5000 words that meet the constraints.
Thanks once again for an interesting, thought-provoking challenge.
You're welcome.
| [reply] |
|
|
| [reply] |
|
|
Re: Challenge: Predictive Texting
by ikegami (Patriarch) on Jan 10, 2007 at 18:01 UTC
|
All entries must support the same API to be considered.
So what's the API? Specifically:
Digits will be passed in one at a time.
To what are the digits passed?
To a program?
To a process?
To a function?
How are the digits passed?
By argument?
As a character on STDIN?
As a line on STDIN?
The expected return value will be...
How is the value returned?
Function return?
STDOUT with newline?
Update: Oops, I see that you've already partially answered this, and the rest can be deduced from that answer. (function/method, function argument, function return)
| [reply] |
|
|
ikegami,
It never ceases to amaze me how much I assume. I regularly post challenges and interesting problems and inevitably there is confusion despite my best efforts at being clear and succinct. In any case, I hope things make sense now.
| [reply] |
Re: Challenge: Predictive Texting
by BrowserUk (Patriarch) on Jan 10, 2007 at 18:22 UTC
|
The problem text to solve is secret
It will contain between 3 and 5 thousand words
No punctuation except a "space"
3000 to 5000 words with no indication of end/start of sentence is one hell of a sentence. Predicting the next word could be much more accurate if you know it is the first word of a sentence or the second etc. ?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
BrowserUk,
I agree and think that would be a great improvement to current predictive texting techniques - at least the ones I have seen. That is a layer of complexity I do not want to add into the challenge for the first round but perhaps will follow it up.
| [reply] |
Re: Challenge: Predictive Texting
by GrandFather (Saint) on Jan 10, 2007 at 23:34 UTC
|
A quick (in the sense of not much time spent on it) first cut version before I head off on holiday. :)
use strict;
use warnings;
use constant LEARN => 1; # Set true to learn by doing
open WORDS, '<', '2of12.txt';
my @words = <WORDS>;
close WORDS;
chomp @words;
@words = map {[$_, asSMS ($_)]} @words;
my $lookup = bless {lookup => {}, nextWordNIndex => 0};
$lookup->addWord ($_) for @words;
$lookup->training ();
$lookup->{scan} = $lookup->{lookup};
my %dict = map {($_->[0] => $_->[1])} @words;
my $strokes = 0;
my $characters = 0;
my $txtWords = 0;
while (<DATA>) {
chomp;
my @dataWords = split;
for my $word (@dataWords) {
$word =~ tr/a-z//cd;
next if ! exists $dict{$word};
++$txtWords;
my @digits = @{asSMS ($word)};
my @pressed;
$characters += @digits;
while (1) {
my $digit = (shift @digits) || '-';
my $result = $lookup->addDigit ($digit);
++$strokes;
push @pressed, $digit;
if ($result eq $word) {
$lookup->addDigit ('+');
++$characters;
++$strokes;
push @pressed, '+';
last;
}
}
print "@pressed ($word)\n";
}
}
print "$strokes strokes to send a $characters character message contai
+ning $txtWords words.\n";
sub addWord {
my ($self, $word) = @_;
my @digits = @{$word->[1]};
my $text = $word->[0];
my $scan = $self->{lookup};
for my $digit (@digits) {
if (defined $scan->{$digit}) {
push @{$scan->{$digit}{words}}, [$text, 0];
} else {
$scan->{$digit} = {words => [[$text, 0]]};
}
$scan = $scan->{$digit};
}
}
sub addDigit {
my ($self, $digit) = @_;
if ($digit eq '+') {
$self->train ($self->{lastGuess}[0], 1) if LEARN;
$self->doneWord ();
return $self->{lastGuess}[0];
} elsif ($digit eq '-') {
if (! exists $self->{scan}{wordsN}) {
@{$self->{scan}{wordsN}} =
grep {$self->{wordlen} == length $_->[0]} @{$self->{sc
+an}{words}};
}
die "Unknown word @{$self->{digits}}"
if $self->{nextWordNIndex} == @{$self->{scan}{wordsN}};
# Skip words that have been tried already
my $wordsN = $self->{scan}{wordsN};
++$self->{nextWordNIndex}
while exists $self->{tries}{$wordsN->[$self->{nextWordNInd
+ex}][0]};
$self->{lastGuess} = $wordsN->[$self->{nextWordNIndex}]
} else {
die "Unknown word @{$self->{digits}} $digit"
if ! exists $self->{scan}{$digit};
push @{$self->{digits}}, $digit;
$self->{scan} = $self->{scan}{$digit};
++$self->{wordlen};
my $index = 0;
my $words = $self->{scan}{words};
# Skip words that have been tried already
++$index while exists $self->{tries}{$words->[$index][0]};
$self->{lastGuess} = $words->[$index]
}
$self->{tries}{$self->{lastGuess}[0]} = 1;
return $self->{lastGuess}[0];
}
sub train {
my ($self, $word, $weight) = @_;
my @digits = @{asSMS ($word)};
my $wordlen = length $word;
$weight ||= 1;
$weight += $wordlen / 10.0;
# Sort the word lists by frequency
my $scan = $self->{lookup};
for my $digit (@digits) {
$scan = $scan->{$digit};
my @match = grep {$_->[0] eq $word} @{$scan->{words}};
last if ! @match;
$_->[1] += $weight for @match;
@{$scan->{words}} = sort {$b->[1] <=> $a->[1]} @{$scan->{words
+}};
}
# Sort the wordN lists by frequency
$scan = $self->{lookup};
$scan = $scan->{$_} for @digits;
if (! exists $scan->{wordsN}) {
@{$scan->{wordsN}} =
grep {$wordlen == length $_->[0]} @{$scan->{words}};
}
my @match = grep {$_->[0] eq $word} @{$scan->{wordsN}};
$_->[1] += $weight for @match;
@{$scan->{wordsN}} = sort {$b->[1] <=> $a->[1]} @{$scan->{wordsN}}
+;
}
sub doneWord {
my $self = shift;
$self->{scan} = $self->{lookup};
$self->{digits} = ();
$self->{wordlen} = 0;
$self->{nextWordNIndex} = 0;
$self->{tries} = ();
}
sub asSMS {
my $word = shift;
my @SMSWord;
$word =~ tr /a-z//dc;
for (split //, $word) {
if (/[abc]/) {
push @SMSWord,'2';
} elsif (/[def]/) {
push @SMSWord,'3';
} elsif (/[ghi]/) {
push @SMSWord,'4';
} elsif (/[jkl]/) {
push @SMSWord,'5';
} elsif (/[mno]/) {
push @SMSWord,'6';
} elsif (/[pqrs]/) {
push @SMSWord,'7';
} elsif (/[tuv]/) {
push @SMSWord,'8';
} elsif (/[wxyz]/) {
push @SMSWord,'9';
}
}
return \@SMSWord;
}
sub training {
my $self = shift;
my %freqs = (
i => 1,
an => 1,
no => 1,
so => 1,
by => 1,
at => 1,
has => 1,
far => 1,
see => 1,
key => 1,
who => 1,
but => 1,
are => 1,
each => 1,
used => 1,
with => 1,
here => 1,
only => 1,
same => 1,
been => 1,
pure => 1,
below => 1,
might => 1,
could => 1,
value => 1,
enter => 1,
solve => 1,
rules => 1,
minus => 1,
space => 1,
pearl => 1,
number => 1,
winner => 1,
fewest => 1,
reduce => 1,
secret => 1,
amount => 1,
chosen => 1,
except => 1,
resume => 1,
return => 1,
texting => 1,
presses => 1,
thought => 1,
correct => 1,
contain => 1,
support => 1,
fastest => 1,
problem => 1,
between => 1,
present => 1,
category => 1,
thousand => 1,
returned => 1,
provided => 1,
expected => 1,
suggested => 1,
incorrect => 1,
lowercase => 1,
technique => 1,
challenge => 1,
interface => 1,
following => 1,
algorithm => 1,
submitted => 1,
solutions => 1,
implement => 1,
suggestion => 1,
dictionary => 1,
authorized => 1,
completion => 1,
considered => 1,
keystrokes => 1,
predictive => 1,
programming => 1,
interesting => 1,
punctuation => 1,
information => 1,
application => 1,
if => 2,
one => 2,
and => 2,
for => 2,
that => 2,
time => 2,
text => 2,
plus => 2,
best => 2,
digit => 2,
being => 2,
means => 2,
digits => 2,
entries => 2,
instead => 2,
spelled => 2,
finished => 2,
must => 3,
words => 3,
passed => 3,
of => 4,
it => 4,
in => 4,
word => 4,
to => 5,
all => 5,
is => 7,
will => 7,
a => 8,
be => 9,
the => 12,
);
if (! LEARN) {
# Cheat by knowing the word frequencey up front
$self->train ($_, $freqs{$_}) for keys %freqs;
}
}
__DATA__
all
predictive texting is a technique to reduce the number of key presses
+used to
enter text with i thought it might be an interesting challenge to see
+who
could implement the best algorithm here are the rules
the problem text to solve is secret
it will contain between and thousand words
all words will be present in only authorized dictionary
all words will be lowercase
no punctuation except a space
solutions must be pure pearl
all entries must be submitted by
one winner for each category below will be chosen
fewest keystrokes
fastest completion time
all entries must support the same application programming interface to
+ be
considered digits will be passed in one at a time the expected return
+value will
be the best suggestion for the amount of information provided so far i
+f instead
of a digit a plus is passed in it means that the word is finished bein
+g spelled
and the correct word has been returned if instead of a digit a minus i
+s passed
in it means that the word is finished being spelled but the suggested
+word is
incorrect digits will resume following a plus
Using an early version of the OP's text as the message text I get the following result:
844 strokes to send a 901 character message containing 170 words.
Note that the 2of12.txt file is expected to reside in the current directory.
Updated to count '+' as strokes and part of message text - they effectively turn into spaces
Updated to take advantage of "knowing" the word frequency in the test text up front. Strokes reduce to 540! This version prints the strokes used for each word.
Set LEARN true to have the code learn the word frequency as the text is supplied. In this case the stroke count increases to 810 (there's a slight algorithm improvement on the first version too;) ).
DWIM is Perl's answer to Gödel
| [reply] [d/l] [select] |
Re: Challenge: Predictive Texting
by demerphq (Chancellor) on Jan 11, 2007 at 09:33 UTC
|
I hate predictive texting and T9. I would much prefer that the phone companies release a phone with the letters properly huffman coded onto the buttons. The fact that I have to press a button twice for some of the most common letters in common English usage annoys me no end. WTF are 'P' and 'W' doing as a single press letters?
Even something where I got to select morphemes based on huffman coding would be much nicer. 'th' 'er', 'ere', 'ing' would go a long way to make things easier, and would be much more consistant and use less phone resources than this crappy AI related stuff that will never work properly. People can learn things like this pretty easy (look how fast kids get at texting given the absolute crap designs out there.)
---
$world=~s/war/peace/g
| [reply] |
|
|
| [reply] |
Re: Challenge: Predictive Texting
by perrin (Chancellor) on Jan 10, 2007 at 15:16 UTC
|
Isn't it pretty much the same problem that all of the JavaScript autocomplete backend scripts solve? There are a few of them on CPAN. | [reply] |
|
|
perrin,
Isn't it pretty much the same problem that all of the JavaScript autocomplete backend scripts solve?
Not exactly, auto-completion is a technique to predict what you are going to type in the future based off what you have already typed. I did a bit of research this morning before posting and there are a number of papers out there regarding predictive texting - so yes, it is a solved problem. That shouldn't deter you from the fun of the challenge.
| [reply] |
Re: Challenge: Predictive Texting
by BrowserUk (Patriarch) on Jan 10, 2007 at 16:39 UTC
|
If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'.
Does that mean that after a '-' is passed in, you'll pass a '+' before resuming passing digits?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
BrowserUk,
The AM is correct. If for instance:
<previous input>
Input: 7
Output: kiss
Input: -
Output: lips
Input: +
Output: lips <repeated>
Input: <digit>
| [reply] [d/l] |
|
|
Sorry, but I'm still not clear what 'the word is finished being spelled' means.
Could it be paraphrased, for example, by 'the suggested word is the same length as the intended word?
| [reply] |
|
|
|
|
|
|
|
Each time you pass '-', the program returns another guess, until '-','+', when more letters/digits are added.
| [reply] |
Re: Challenge: Predictive Texting
by GrandFather (Saint) on Jan 10, 2007 at 19:57 UTC
|
I notice that some of the words in the 2of12inf.txt file have trailing % characters. What are they for?
DWIM is Perl's answer to Gödel
| [reply] |
|
|
| [reply] |
Re: Challenge: Predictive Texting
by Anonymous Monk on Jan 10, 2007 at 15:31 UTC
|
Which file contains 2of12.txt? | [reply] |
|
|
| [reply] |
Re: Challenge: Predictive Texting
by halley (Prior) on Jan 10, 2007 at 19:06 UTC
|
Isn't T9 a commercially available form of this, on a word-by-word basis? Are you trying to find something that is faster than T9, or just an open implementation of a T9-alike?
-- [ e d @ h a l l e y . c c ]
| [reply] |
|
|
halley,
My wife kept me awake last night texting her friends and family. Since I couldn't sleep, I began wondering about the algorithm(s) used to make predictive texting work. I decided I would post it as a challenge today since there are several monks that enjoy interesting distractions. I don't have any other purpose for it than that.
| [reply] |
Re: Challenge: Predictive Texting
by BrowserUk (Patriarch) on Jan 11, 2007 at 14:06 UTC
|
It's not clear to me yet how the user would indicate that they have entered the full spelling of a word that does not exist in the algorithms lexican?
Or, to put it another way, how does the algorithm indicate that it has no more alternatives when the user is indicating that they've entered enough digits to spell the word, but have yet to be offered the correct prediction?
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
BrowserUk,
...how does the algorithm indicate that it has no more alternatives when the user is indicating that they've entered enough digits to spell the word, but have yet to be offered the correct prediction, and the algorithm has no further alternatives?
- All words will be present in 2of12.txt (only authorized dictionary)
If your algorithm fails to present the word that is in the dictionary provided it fails completely. This is to simplify things as was the removal of punctuation, adaptive learning, determining intended word by position in sentence, etc. I wanted to start with the bare essentials since I was only giving a week time frame and most people do have jobs.
| [reply] |
|
|
| [reply] |
|
|
Re: Challenge: Predictive Texting
by Anonymous Monk on Jan 10, 2007 at 15:22 UTC
|
All entries must support the same API to be considered.
Please define the API, preferably in perl. | [reply] |
|
|
Anonymous Monk],
Please define the API, preferably in perl.
To quote myself:
Digits will be passed in one at a time. The expected return value will be the best "suggestion" for the amount of information provided so far. If instead of a digit, a '+' is passed in, it means that the word is finished being spelled and the correct word has been returned. If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'.
Admittedly, I should have said "All entries must provide a similar API to be considered." In other words, your solution must provide a mechanism for accepting input and providing output as described above.
| [reply] |
|
|
What is the mechanism for accepting input/providing output?
| [reply] |
|
|
Re: Challenge: Predictive Texting
by Anonymous Monk on Jan 10, 2007 at 16:28 UTC
|
my %keyb = (
2 => [qw[a b c]],
3 => [qw[d e f]],
4 => [qw[h i j]],
5 => [qw[k l m]],
6 => [qw[m n o]],
7 => [qw[p q r s]],
8 => [qw[t u v]],
9 => [qw[w x y z]],
);
| [reply] [d/l] |
|
|
| [reply] [d/l] |
|
|
From the update, Update (2007-01-10 12:23 EST): A zero '0' will be used to indicate a space
| [reply] |
|
|