I think that this is not a good software X-prize contender because:

  1. What constitutes "any human langauge"?
    • LA street slang?
    • Egyptian hyroglyphics?
    • Chaucer's english?
    • Pidgeon?
    • Bill & Ted speak?
    • Clockwork Orange "newspeak"?
    • WW2 Navaho Indian code?
  2. Is this is an "arbitrary piece of text".

    platelayers paratrooper spumoni subversive bala womenfolk zealot wangling gym clout proxemic abravanel entryway assimilates faucets dialup's lamellate apparent propositioning olefin froude.

  3. Neither the goal nor the criteria specify anything about meaning.

    Input: "Mich interessiert das Thema, weil ich fachlich/ beruflich mit Betroffenen zu tun habe."

    Output: "Engaged computing means computational processes that are engaged with the world—think embedded systems."

    Both sentences are (probably) pretty well-formed in there respective languages. The cause of my indecison is that:

    1. I don't speak German, so I can comment on the first.
    2. My native english skills are far from brilliant. The second seems to make sense, and was probably written by a human being, but whether a English language teacher would find it so is a different matter.

    However, the two sentances (probably) have very little common meaning, as I picked them at random off the net.

The problem with every definition I've seen of "Natural Langauge Processing"; is that it assumes that it is possible to encode not only the syntactic and semantic information contained in a piece of meaningful, correct* text in such a way that all of that information can be embodied into some other langauge. It also suggests that all the meta information that the human brain devines from some auxillary clues, like context; previous awareness of the writer's style; attitudes and prejudices; and a whole lot more besides.

*How do we deal with almost correct input?

Even a single word can have many meanings which the human being in many cases can devine through context. Eg.

Fire.

In the absence of any meta-clues, there are at least 3 or 4 possible interpretations of that single word. Chances are, that english is the only langauge in which those 3 or 4 meanings use the same word.

Then there are phrases like: "Oh really". Without context, that can be a genuine enquiry, or pure sarcasm. In many cases, even native english speakers are hard pushed to discern the intended meaning even with the benefit of hearing the spoken inflection and being party to the context.

Indeed, whenever the use of language moves beyond the simplest of purely descriptive use, the meaning heard by the listener (or read by the reader) is as much a function of the listener/readers experiences, biases and knowledge as it is of the speaker's or writer's.

How often, even in this place with it's fairly constrained focus do half a dozen readers come away with different interpretations of a writer's words?

If you translate a document from one langauge to another, word-by-word, you usually end with garbage. If you translate phrase by phrase, you need a huge lookup table of all the billions of possible phrases and you might end up with something that reads more fluently, but there are two problems.

  1. The mappings of phrase to phrase in each langauge would need to be done by a human being fluent in both langauges (or a pair of native speakers of the two langauges that could contrast and compare possible meanings until they arrived at a concensus). This would be a huge undertaking for any single pair of langauges; but for all human languages?
  2. Even then, it's not too hard to sit and construct a phrase in english and a translation of it in a second langauge that would be correct in some contexts but utterly wrong in others.

    How do you encapsulate the variability between what the writer intended to write, and what the reader read? And more so, the differences in meaning percieved between two or more readers reading the same words? Or the same reader, reading the same words in two or more different contexts?

Using the "huge lookup table" method, the magnitude of the problem is not the hardware problem of storing and fast retreival of the translation databases. The problem is of constructing them in the first place.

The 'other' method of achieving the goal, that of translating all of the syntactic, semantic, contextual, environmental and every other "...al" meaning that is embodied within natural language into some machine encodable intermediate language. So that once so encoded, the translation to other langauges can be done by applying a set of language specific "construction rules", is even less feasible.

I think the problem with Natural Langauge Processing is that as yet, even the cleverest and most fluent speakers (in any language) have not found a way to use natural langauge to convey exact meanings even to others fluent in the same language.

Until humans can achieve this with a reasonable degree of accuracy and precision, writing a computer program to do it is a non-starter.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

In reply to Re: X-Prize: Natural Language Processing by BrowserUk
in thread X-prize software challenge? by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.