in reply to Challenge: Predictive Texting

Right, I think I understand the rules now. :-)

A subsidiary question: are we allowed to sort our datastructure by frequency of paragrams (or 'textonyms' as I learn that they are also called from the wikipedia page you link to)?

If so, does anybody know of a freely available list of word frequencies in US English*? (A good UK English resource is this site, which uses the British National Corpus).

Come to think of it, the answer to this probably depends on yet another subsidiary question that I have: will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)?

Looking through 2of12.txt, I see that it is extremely poor in inflected forms (plurals, past tenses...) - even 'lips', which you use in several a couple of your examples above, is not included - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list.

* Note that 2of12.txt contains few or no UK English variant spellings (no 'colour', 'criticise', 'manoeuvre'...).

** BTW, how should we parse 'between 3 and 5 thousand': 'between 3 and 5000', or 'between 3000 and 5000'? </nitpick>

Update: PS, I forgot to add this. Thanks once again for an interesting, thought-provoking challenge. Limbic~Region++!

Replies are listed 'Best First'.
Re^2: Challenge: Predictive Texting
by Limbic~Region (Chancellor) on Jan 10, 2007 at 19:55 UTC
    Not_a_Number,
    ...are we allowed to sort our datastructure by frequency of paragrams..

    Yes. In fact, the reason the mystery text remains secret is so this technique is not applied to just that text skewing the results.

    If so, does anybody know of a freely available list of word frequencies in US English?

    I am fairly certain I came across one this morning when researching but can't be sure that it was US English.

    will the mystery text** consist of (a) more or less 'normal' English prose (albeit with punctation and capitalisation removed) or (b) a more or less random string of words (in which case frequency considerations will be otiose)?

    More or less US English prose.

    ... - which means that it would be pretty difficult to construct a coherent text of any length consisting of words only to be found in the list.

    You are quite correct. The 2of12inf.txt does a much better job in this area. On the other hand, if an entire book can be written without using the letter e in two different languages, I am sure that it will not be too difficult to provide mystery text between 3000 and 5000 words that meet the constraints.

    Thanks once again for an interesting, thought-provoking challenge.

    You're welcome.

    Cheers - L~R

      On the other hand, if an entire book can be written without using the letter e in two different languages, (...)

      That would be the famous book by Georges Perec: A_Void (originally "La disparition").