Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

All,
Predictive texting is technique to reduce the number of key presses used to enter text with SMS. I thought it might be an interesting challenge to see who could implement the best algorithm. Here are the rules:

One winner for each category below will be chosen:

All entries must support the same API to be considered. Digits will be passed in one at a time. The expected return value will be the best "suggestion" for the amount of information provided so far. If instead of a digit, a '+' is passed in, it means that the word is finished being spelled and the correct word has been returned. If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'.

This is of course a silly competition with no real prizes for the winner. Feel free to break any or all rules and you are still likely to get accolades from your fellow competitors. Also, submissions long after the dead line are still encouraged. The reason for picking an arbitrary date is that at some point the secret text will need to be revealed allowing frequency analysis to optimize solutions for that text only.

Any clarifications will be identified below and time stamped.
Update (2007-01-10 11:13 EST): The 2of12.txt dictionary file can be found inside this zip.
Update (2007-01-10 11:25 EST): The keypad will be a standard ITU-T E.161
Update (2007-01-10 12:23 EST): A zero '0' will be used to indicate a space
Update (2007-01-11 08:35 EST): It is obvious that a space is not required to indicate separation of words (+ works fine)

Cheers - L~R

Replies are listed 'Best First'.
Re: Challenge: Predictive Texting
by ikegami (Patriarch) on Jan 10, 2007 at 18:01 UTC

    All entries must support the same API to be considered.

    So what's the API? Specifically:

    Digits will be passed in one at a time.

    To what are the digits passed?
    To a program?
    To a process?
    To a function?

    How are the digits passed?
    By argument?
    As a character on STDIN?
    As a line on STDIN?

    The expected return value will be...

    How is the value returned?
    Function return?
    STDOUT with newline?

    Update: Oops, I see that you've already partially answered this, and the rest can be deduced from that answer. (function/method, function argument, function return)

      ikegami,
      It never ceases to amaze me how much I assume. I regularly post challenges and interesting problems and inevitably there is confusion despite my best efforts at being clear and succinct. In any case, I hope things make sense now.

      Cheers - L~R

Re: Challenge: Predictive Texting
by Not_a_Number (Prior) on Jan 10, 2007 at 19:26 UTC

    Right, I think I understand the rules now. :-)

    A subsidiary question: are we allowed to sort our datastructure by frequency of paragrams (or 'textonyms' as I learn that they are also called from the wikipedia page you link to)?

    If so, does anybody know of a freely available list of word frequencies in US English*? (A good UK English resource is this site, which uses the British National Corpus).

    Come to think of it, the answer to this probably depends on yet another subsidiary question that I have: will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)?

    Looking through 2of12.txt, I see that it is extremely poor in inflected forms (plurals, past tenses...) - even 'lips', which you use in several a couple of your examples above, is not included - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list.

    * Note that 2of12.txt contains few or no UK English variant spellings (no 'colour', 'criticise', 'manoeuvre'...).

    ** BTW, how should we parse 'between 3 and 5 thousand': 'between 3 and 5000', or 'between 3000 and 5000'? </nitpick>

    Update: PS, I forgot to add this. Thanks once again for an interesting, thought-provoking challenge. Limbic~Region++!

      Not_a_Number,
      ...are we allowed to sort our datastructure by frequency of paragrams..

      Yes. In fact, the reason the mystery text remains secret is so this technique is not applied to just that text skewing the results.

      If so, does anybody know of a freely available list of word frequencies in US English?

      I am fairly certain I came across one this morning when researching but can't be sure that it was US English.

      will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)?

      More or less US English prose.

      ... - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list.

      You are quite correct. The 2of12inf.txt does a much better job in this area. On the other hand, if an entire book can be written without using the letter e in two different languages, I am sure that it will not be too difficult to provide mystery text between 3000 and 5000 words that meet the constraints.

      Thanks once again for an interesting, thought-provoking challenge.

      You're welcome.

      Cheers - L~R

        On the other hand, if an entire book can be written without using the letter e in two different languages, (...)

        That would be the famous book by Georges Perec: A_Void (originally "La disparition").

Re: Challenge: Predictive Texting
by perrin (Chancellor) on Jan 10, 2007 at 15:16 UTC
    Isn't it pretty much the same problem that all of the JavaScript autocomplete backend scripts solve? There are a few of them on CPAN.
      perrin,
      Isn't it pretty much the same problem that all of the JavaScript autocomplete backend scripts solve?

      Not exactly, auto-completion is a technique to predict what you are going to type in the future based off what you have already typed. I did a bit of research this morning before posting and there are a number of papers out there regarding predictive texting - so yes, it is a solved problem. That shouldn't deter you from the fun of the challenge.

      Cheers - L~R

Re: Challenge: Predictive Texting
by BrowserUk (Patriarch) on Jan 10, 2007 at 18:22 UTC

    The problem text to solve is secret

    It will contain between 3 and 5 thousand words

    No punctuation except a "space"

    3000 to 5000 words with no indication of end/start of sentence is one hell of a sentence. Predicting the next word could be much more accurate if you know it is the first word of a sentence or the second etc. ?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      I agree and think that would be a great improvement to current predictive texting techniques - at least the ones I have seen. That is a layer of complexity I do not want to add into the challenge for the first round but perhaps will follow it up.

      Cheers - L~R

Re: Challenge: Predictive Texting
by GrandFather (Saint) on Jan 10, 2007 at 23:34 UTC

    A quick (in the sense of not much time spent on it) first cut version before I head off on holiday. :)

    Using an early version of the OP's text as the message text I get the following result:

    844 strokes to send a 901 character message containing 170 words.

    Note that the 2of12.txt file is expected to reside in the current directory.

    Updated to count '+' as strokes and part of message text - they effectively turn into spaces

    Updated to take advantage of "knowing" the word frequency in the test text up front. Strokes reduce to 540! This version prints the strokes used for each word.

    Set LEARN true to have the code learn the word frequency as the text is supplied. In this case the stroke count increases to 810 (there's a slight algorithm improvement on the first version too;) ).


    DWIM is Perl's answer to Gödel
Re: Challenge: Predictive Texting
by demerphq (Chancellor) on Jan 11, 2007 at 09:33 UTC

    I hate predictive texting and T9. I would much prefer that the phone companies release a phone with the letters properly huffman coded onto the buttons. The fact that I have to press a button twice for some of the most common letters in common English usage annoys me no end. WTF are 'P' and 'W' doing as a single press letters?

    Even something where I got to select morphemes based on huffman coding would be much nicer. 'th' 'er', 'ere', 'ing' would go a long way to make things easier, and would be much more consistant and use less phone resources than this crappy AI related stuff that will never work properly. People can learn things like this pretty easy (look how fast kids get at texting given the absolute crap designs out there.)

    ---
    $world=~s/war/peace/g

      demerphq,
      I am not a big fan of predictive texting either. Actually, I don't even own a cell phone. I didn't let that deter me from an interesting challenge though.

      It looks like one follow-on challenge then will be to propose a new keypad layout to see how well your alternative(s) match up.

      Cheers - L~R

Re: Challenge: Predictive Texting
by BrowserUk (Patriarch) on Jan 10, 2007 at 16:39 UTC
    If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'.

    Does that mean that after a '-' is passed in, you'll pass a '+' before resuming passing digits?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      The AM is correct. If for instance:
      <previous input> Input: 7 Output: kiss Input: - Output: lips Input: + Output: lips <repeated> Input: <digit>

      Cheers - L~R

        Sorry, but I'm still not clear what 'the word is finished being spelled' means.

        Could it be paraphrased, for example, by 'the suggested word is the same length as the intended word?

      Each time you pass '-', the program returns another guess, until '-','+', when more letters/digits are added.
Re: Challenge: Predictive Texting
by GrandFather (Saint) on Jan 10, 2007 at 19:57 UTC

    I notice that some of the words in the 2of12inf.txt file have trailing % characters. What are they for?


    DWIM is Perl's answer to Gödel
      GrandFather,
      From the readme - Plurals of "uncountable" nouns were included, annotated with the "%" suffix character. See below for an extended discussion of the inclusion of these words.

      Keep in mind that 2of12.txt ne 2of12inf.txt though I am thinking for add-ons to the challenge, UK english as well as inflections will add to the complexity though the techniques should remain relatively the same.

      Cheers - L~R

Re: Challenge: Predictive Texting
by Anonymous Monk on Jan 10, 2007 at 15:31 UTC
    Which file contains 2of12.txt?
Re: Challenge: Predictive Texting
by halley (Prior) on Jan 10, 2007 at 19:06 UTC
    Isn't T9 a commercially available form of this, on a word-by-word basis? Are you trying to find something that is faster than T9, or just an open implementation of a T9-alike?

    --
    [ e d @ h a l l e y . c c ]

      halley,
      My wife kept me awake last night texting her friends and family. Since I couldn't sleep, I began wondering about the algorithm(s) used to make predictive texting work. I decided I would post it as a challenge today since there are several monks that enjoy interesting distractions. I don't have any other purpose for it than that.

      Cheers - L~R

Re: Challenge: Predictive Texting
by BrowserUk (Patriarch) on Jan 11, 2007 at 14:06 UTC

    It's not clear to me yet how the user would indicate that they have entered the full spelling of a word that does not exist in the algorithms lexican?

    Or, to put it another way, how does the algorithm indicate that it has no more alternatives when the user is indicating that they've entered enough digits to spell the word, but have yet to be offered the correct prediction?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUk,
      ...how does the algorithm indicate that it has no more alternatives when the user is indicating that they've entered enough digits to spell the word, but have yet to be offered the correct prediction, and the algorithm has no further alternatives?

      • All words will be present in 2of12.txt (only authorized dictionary)

      If your algorithm fails to present the word that is in the dictionary provided it fails completely. This is to simplify things as was the removal of punctuation, adaptive learning, determining intended word by position in sentence, etc. I wanted to start with the bare essentials since I was only giving a week time frame and most people do have jobs.

      Cheers - L~R

        What do you mean "removal of ... adaptive learning"? Have I to go back and rewrite my submission!


        DWIM is Perl's answer to Gödel
Re: Challenge: Predictive Texting
by Anonymous Monk on Jan 10, 2007 at 15:22 UTC
    All entries must support the same API to be considered. Please define the API, preferably in perl.
      Anonymous Monk],
      Please define the API, preferably in perl.

      To quote myself:
      Digits will be passed in one at a time. The expected return value will be the best "suggestion" for the amount of information provided so far. If instead of a digit, a '+' is passed in, it means that the word is finished being spelled and the correct word has been returned. If instead of a digit, a '-' is passed in, it means that the word is finished being spelled but the "suggested" word is incorrect. Digits will resume following a '+'.

      Admittedly, I should have said "All entries must provide a similar API to be considered." In other words, your solution must provide a mechanism for accepting input and providing output as described above.

      Cheers - L~R

        What is the mechanism for accepting input/providing output?
Re: Challenge: Predictive Texting
by Anonymous Monk on Jan 10, 2007 at 16:28 UTC
    my %keyb = ( 2 => [qw[a b c]], 3 => [qw[d e f]], 4 => [qw[h i j]], 5 => [qw[k l m]], 6 => [qw[m n o]], 7 => [qw[p q r s]], 8 => [qw[t u v]], 9 => [qw[w x y z]], );

      Isn't '#' => [' '] missing there?

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

        From the update, Update (2007-01-10 12:23 EST): A zero '0' will be used to indicate a space