in reply to X-Prize: Natural Language Processing
in thread X-prize software challenge?

I think that this is not a good software X-prize contender because:

  1. What constitutes "any human langauge"?
    • LA street slang?
    • Egyptian hyroglyphics?
    • Chaucer's english?
    • Pidgeon?
    • Bill & Ted speak?
    • Clockwork Orange "newspeak"?
    • WW2 Navaho Indian code?
  2. Is this is an "arbitrary piece of text".

    platelayers paratrooper spumoni subversive bala womenfolk zealot wangling gym clout proxemic abravanel entryway assimilates faucets dialup's lamellate apparent propositioning olefin froude.

  3. Neither the goal nor the criteria specify anything about meaning.

    Input: "Mich interessiert das Thema, weil ich fachlich/ beruflich mit Betroffenen zu tun habe."

    Output: "Engaged computing means computational processes that are engaged with the world—think embedded systems."

    Both sentences are (probably) pretty well-formed in there respective languages. The cause of my indecison is that:

    1. I don't speak German, so I can comment on the first.
    2. My native english skills are far from brilliant. The second seems to make sense, and was probably written by a human being, but whether a English language teacher would find it so is a different matter.

    However, the two sentances (probably) have very little common meaning, as I picked them at random off the net.

The problem with every definition I've seen of "Natural Langauge Processing"; is that it assumes that it is possible to encode not only the syntactic and semantic information contained in a piece of meaningful, correct* text in such a way that all of that information can be embodied into some other langauge. It also suggests that all the meta information that the human brain devines from some auxillary clues, like context; previous awareness of the writer's style; attitudes and prejudices; and a whole lot more besides.

*How do we deal with almost correct input?

Even a single word can have many meanings which the human being in many cases can devine through context. Eg.

Fire.

In the absence of any meta-clues, there are at least 3 or 4 possible interpretations of that single word. Chances are, that english is the only langauge in which those 3 or 4 meanings use the same word.

Then there are phrases like: "Oh really". Without context, that can be a genuine enquiry, or pure sarcasm. In many cases, even native english speakers are hard pushed to discern the intended meaning even with the benefit of hearing the spoken inflection and being party to the context.

Indeed, whenever the use of language moves beyond the simplest of purely descriptive use, the meaning heard by the listener (or read by the reader) is as much a function of the listener/readers experiences, biases and knowledge as it is of the speaker's or writer's.

How often, even in this place with it's fairly constrained focus do half a dozen readers come away with different interpretations of a writer's words?

If you translate a document from one langauge to another, word-by-word, you usually end with garbage. If you translate phrase by phrase, you need a huge lookup table of all the billions of possible phrases and you might end up with something that reads more fluently, but there are two problems.

  1. The mappings of phrase to phrase in each langauge would need to be done by a human being fluent in both langauges (or a pair of native speakers of the two langauges that could contrast and compare possible meanings until they arrived at a concensus). This would be a huge undertaking for any single pair of langauges; but for all human languages?
  2. Even then, it's not too hard to sit and construct a phrase in english and a translation of it in a second langauge that would be correct in some contexts but utterly wrong in others.

    How do you encapsulate the variability between what the writer intended to write, and what the reader read? And more so, the differences in meaning percieved between two or more readers reading the same words? Or the same reader, reading the same words in two or more different contexts?

Using the "huge lookup table" method, the magnitude of the problem is not the hardware problem of storing and fast retreival of the translation databases. The problem is of constructing them in the first place.

The 'other' method of achieving the goal, that of translating all of the syntactic, semantic, contextual, environmental and every other "...al" meaning that is embodied within natural language into some machine encodable intermediate language. So that once so encoded, the translation to other langauges can be done by applying a set of language specific "construction rules", is even less feasible.

I think the problem with Natural Langauge Processing is that as yet, even the cleverest and most fluent speakers (in any language) have not found a way to use natural langauge to convey exact meanings even to others fluent in the same language.

Until humans can achieve this with a reasonable degree of accuracy and precision, writing a computer program to do it is a non-starter.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
  • Comment on Re: X-Prize: Natural Language Processing

Replies are listed 'Best First'.
Re^2: X-Prize: Natural Language Processing
by dragonchild (Archbishop) on Oct 17, 2004 at 00:25 UTC
    The goal of a NLP program fluent in any two given languages is to provide the same capabilities that a person fluent in two languages would provide between those two languages. If a person fluent in English and some other language were to be asked "Translate the sentence Fire.", that person would ask for context. A properly-written NLP program would do the same thing. The same goes for your list of words grabbed at random from /usr/dict.

    The 'other' method of achieving the goal, that of translating all of the syntactic, semantic, contextual, environmental and every other "...al" meaning that is embodied within natural language into some machine encodable intermediate language. So that once so encoded, the translation to other langauges can be done by applying a set of language specific "construction rules", is even less feasible.

    Why is that so? I think that the problem is a problem of algorith and data structure. Most attempts I've seen (and my cousin wrote her masters on the very topic ... in French) attempts to follow standard sentence deconstruction, the kind you learned in English class. I think that this method fails to understand the purpose of language.

    Language, to my mind, is meant to convey concepts. Very fuzzy, un-boxable concepts. But, the only intermediate language we have is, well, language. So, we encode in a very lossy algorithm to words, phrases, sentences, and paragraphs. Then, the listener decodes in a similarly lossy algorithm (which isn't the same algorithm anyone else would use to decode the same text) into their framework of concepts. Usually, the paradigms are close enough or the communication is generic enough that transmission of concepts is possible. However, there are many instances, and I'm sure each of us has run into one, where the concepts we were trying to communicate did not get through. And, this is a problem, as you noted, not just a problem between speakers of different languages, but also between fluent speakers of the same language.

    I would like to note that such projects of constructing an intermediate language have successfully occurred in the past. The most notable example of this is the Chinese writing system. There are at least 5 major languages that use the exact same writing system. Thus, someone who speaks only Mandarin can communicate just fine with someone who speaks only Cantonese, solely by written communication. There are other examples, but none as wide-reaching. So, it is a feasible idea. And, I think, it's a critical idea. If we can come up with an intermediate language representing the actual concepts being communicated, that would revolutionize philosophy, linguistics, computer science, and a host of other fields. It's not a matter of whether this project is worthwhile. I think it's a matter of we cannot afford to not do it.

    Being right, does not endow the right to be rude; politeness costs nothing.
    Being unknowing, is not the same as being stupid.
    Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
    Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

      that person would ask for context. A properly-written NLP program would do the same thing.

      If I sent you a /msg saying "Why do you deprecate tie and not bless?", you will immediately be able to respond to that question.

      Ask an (existing) NLP to translate that into & from any other language, or even host of other languages and you get:

      1. German: "Warum mißbilligen Sie Riegel und segnen nicht?"

        "Why do you disapprove latch plates and do segnen not?"

      2. French:"Pourquoi est-ce que vous désapprouvez la cravate et ne la bénissezpas?"

        "Why do you disapprove the tie and it bénissezpas?"

      3. Chinese: "?????????????".

        "Why do you belittle the tie and do not bless?"

      4. Japanese: "????????????????"

        "Why, you criticize the tie, don't praise?"

      5. Dutch:"Waarom keurt zegent niet u band af en?"

        "Why inspects don't you bless link finished and?"

      6. Russion:"?????? ?? deprecate ????? ? ?? ???????????????"

        "Why you deprecate connection and do not bless?"

      7. Korean:"?? ? ??? ?????? ???? ????"

        "Tie under criticizing it boils it doesn't bless why it spreads out?"

      That looked damned impressive when I pasted it, and rubbish now I've submitted it:(

      I'm only vaguely fluent in one of those languages, and I would be hard pushed to recognise the question I asked, even though I am completely aware of the context, background and content.

      How can a NLP "ask for context"? Most of the context of that question is completely absent from this post; this entire thread; some of it depends upon information only coveyed between us (you and I) through private communications. Without having all of the background information, and/or one of us to interrogate, can you ever imagine an NLP being able to communicate the essence of the question I am asking in another language?

      Even a human being, fluent in English and whichever other langauge(s) we want it translated into, would be extremely hard pushed to convey the essence of that question without they also have an pretty intimate knowledge of not just programming in general, but Perl 5 specifically. Indeed, the question would be confusing and essentially meaningless, even in English, to anyone without the requisite background.

      And that's my point. Human speech shorthands so much, on the basis of the speaker's knowledge of the of the listener's knowledge and experience. Try the mental exercise of just how much extra information would be required to allow another native English speaker, who has no knowledge of computers or Perl, to understand that question. I seriously doubt it could be done in less than 50,000 words?

      Now imagine trying to translate those 50,000 words into Navaho Indian, or Inuit such that a native of those languages without computer and Perl 5 experience could understand it?

      By now, your probably thinking "But the recipient of such a message would have that experience, otherwise you wouldn't be asking them that question", and you would be right. But it is my contention, that if the NLP is going to be able to convey the essence of the words 'tie' and 'bless' in the question, into suitably non-bondage-related and non-religious-related terms in the target language, it would need that same knowledge.

      Of course, then you might say that: "If the recipient knows Perl programming, then the is no need, and it would in fact be detrimental, to translate those terms at all". But then the NLP has to have that knowledge in order to know not to translate those two terms. It would also need to 'know' that the recipent had the knowledge to understand the untranslated terms!

      Apply that same logic to conversation between neurosurgeons, or particle physisists, or sushi chefs, or hair-stylists, or mothers.

      Spoken and written langauge is rife with supposed knowledge, and contextual inference. Just as I would have extreme difficulty trying to explain the background of the question to a Japanese Sushi chef. He would have extreme difficulty in explaining the process of preparing Blowfish to me.

      Not only can I see no way to encapsulate all that disparate knowledge into a computer program, neither can I see how to program the computer to ask the right questions to allow the translation of such information.

      And, I think, it's a critical idea. If we can come up with an intermediate language representing the actual concepts being communicated, that would revolutionize philosophy, linguistics, computer science, and a host of other fields. It's not a matter of whether this project is worthwhile. I think it's a matter of we cannot afford to not do it.

      I agree with the sentiment of this, but not the approach. Not because I wouldn't like it to suceed, but because I simply do not see the time when this will be feasible. Even with Moore's Law working for use (for how much longer?), I do not see the means by which it would be acheivable.

      I also think that the underlying problem will be resolved before we find a technological solution to it, in a rather more prosaic, but ultimately more practical fashion. I think over time, the diversity of human langauge will steadily reduce until the problem "goes away".

      I suspect that a single common langauge will become universal. I doubt it will be recognisably any one of the currently existing langauges. More a bastardisation of several of the more widely spoken ones all run together. I imagine that there will be a fairly high content of English (because it's lingua franca in so many fields already), French (because the French are stubborn enough to ensure it. Besides which it's too nice a language to allow to die), Chinese and one of the Indian sub-continental languages (because between them they cover about 1/3rd of the world's population), and probably many bits of many others.

      Basically, we (our children('s children?)) will all become bi-lingual. Our 'native' tongues, and "worldspeak".

      Always assuming that we don't run out of fuel, water or air before then!


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "Think for yourself!" - Abigail
      "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
        Maybe the goal for the NLP X-Prize should be a little more ... constrained. I was thinking about constraints this evening. What about the following change?

        A successful NLP X-Prize program should be able to translate any paragraph that is:

        • self-contained with respect to context
        • restricted to a given wordlist (maybe 1000-2500 of the most common words)
        and translate that paragraph from any one of N languages to any other of those languages. The languages would be set in the final specification, but would include the following:
        • English
        • French
        • Chinese (written)
        • Japanese (Kanji, probably)
        • Hindi
        • German
        • Russian
        • Arabic
        • Navajo (or some other AmerIndian language)
        The program would perform the translation under 1 second per word of input or output, whichever is greater.

        The wordlist would be chosen to be dialect-agnostic, as much as is possible. Most of these words would be the words most people learn in grammar school. I'm talking about words like "in", "is", "have", "run", "cat", "dog", etc.

        The X-Prize wouldn't go to the program that can translate physics texts, just like the Ansari prize didn't go to the ship that would actually ferry passengers. It went to the ship that demonstrated the feasibility of technologies. Now that SpaceShipOne has succeeded, new ships will be built to actually make it a commercial venture. I would expect the same to happen for any other X-Prize.

        Being right, does not endow the right to be rude; politeness costs nothing.
        Being unknowing, is not the same as being stupid.
        Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
        Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.