Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

X-Prize: Natural Language Processing

by dragonchild (Archbishop)
on Oct 15, 2004 at 15:49 UTC ( [id://399552]=note: print w/replies, xml ) Need Help??


in reply to Re: X-prize Suggestions here please!
in thread X-prize software challenge?

Goal
To create a program (with any needed hardware) that can translate any arbitrary piece of text from any human language to any other human language.

Criteria
The criteria here are going to be a little vague, but hopefully we can expand on it.

  • Translations must take no longer than 1 second per word.
  • An arbitrarily chosen native speaker of the language translated into must not be able to discern that it was a computer-generated translation.

Being right, does not endow the right to be rude; politeness costs nothing.
Being unknowing, is not the same as being stupid.
Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

Replies are listed 'Best First'.
Re: X-Prize: Natural Language Processing
by BrowserUk (Patriarch) on Oct 16, 2004 at 16:33 UTC

    I think that this is not a good software X-prize contender because:

    1. What constitutes "any human langauge"?
      • LA street slang?
      • Egyptian hyroglyphics?
      • Chaucer's english?
      • Pidgeon?
      • Bill & Ted speak?
      • Clockwork Orange "newspeak"?
      • WW2 Navaho Indian code?
    2. Is this is an "arbitrary piece of text".

      platelayers paratrooper spumoni subversive bala womenfolk zealot wangling gym clout proxemic abravanel entryway assimilates faucets dialup's lamellate apparent propositioning olefin froude.

    3. Neither the goal nor the criteria specify anything about meaning.

      Input: "Mich interessiert das Thema, weil ich fachlich/ beruflich mit Betroffenen zu tun habe."

      Output: "Engaged computing means computational processes that are engaged with the world—think embedded systems."

      Both sentences are (probably) pretty well-formed in there respective languages. The cause of my indecison is that:

      1. I don't speak German, so I can comment on the first.
      2. My native english skills are far from brilliant. The second seems to make sense, and was probably written by a human being, but whether a English language teacher would find it so is a different matter.

      However, the two sentances (probably) have very little common meaning, as I picked them at random off the net.

    The problem with every definition I've seen of "Natural Langauge Processing"; is that it assumes that it is possible to encode not only the syntactic and semantic information contained in a piece of meaningful, correct* text in such a way that all of that information can be embodied into some other langauge. It also suggests that all the meta information that the human brain devines from some auxillary clues, like context; previous awareness of the writer's style; attitudes and prejudices; and a whole lot more besides.

    *How do we deal with almost correct input?

    Even a single word can have many meanings which the human being in many cases can devine through context. Eg.

    Fire.

    In the absence of any meta-clues, there are at least 3 or 4 possible interpretations of that single word. Chances are, that english is the only langauge in which those 3 or 4 meanings use the same word.

    Then there are phrases like: "Oh really". Without context, that can be a genuine enquiry, or pure sarcasm. In many cases, even native english speakers are hard pushed to discern the intended meaning even with the benefit of hearing the spoken inflection and being party to the context.

    Indeed, whenever the use of language moves beyond the simplest of purely descriptive use, the meaning heard by the listener (or read by the reader) is as much a function of the listener/readers experiences, biases and knowledge as it is of the speaker's or writer's.

    How often, even in this place with it's fairly constrained focus do half a dozen readers come away with different interpretations of a writer's words?

    If you translate a document from one langauge to another, word-by-word, you usually end with garbage. If you translate phrase by phrase, you need a huge lookup table of all the billions of possible phrases and you might end up with something that reads more fluently, but there are two problems.

    1. The mappings of phrase to phrase in each langauge would need to be done by a human being fluent in both langauges (or a pair of native speakers of the two langauges that could contrast and compare possible meanings until they arrived at a concensus). This would be a huge undertaking for any single pair of langauges; but for all human languages?
    2. Even then, it's not too hard to sit and construct a phrase in english and a translation of it in a second langauge that would be correct in some contexts but utterly wrong in others.

      How do you encapsulate the variability between what the writer intended to write, and what the reader read? And more so, the differences in meaning percieved between two or more readers reading the same words? Or the same reader, reading the same words in two or more different contexts?

    Using the "huge lookup table" method, the magnitude of the problem is not the hardware problem of storing and fast retreival of the translation databases. The problem is of constructing them in the first place.

    The 'other' method of achieving the goal, that of translating all of the syntactic, semantic, contextual, environmental and every other "...al" meaning that is embodied within natural language into some machine encodable intermediate language. So that once so encoded, the translation to other langauges can be done by applying a set of language specific "construction rules", is even less feasible.

    I think the problem with Natural Langauge Processing is that as yet, even the cleverest and most fluent speakers (in any language) have not found a way to use natural langauge to convey exact meanings even to others fluent in the same language.

    Until humans can achieve this with a reasonable degree of accuracy and precision, writing a computer program to do it is a non-starter.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
      The goal of a NLP program fluent in any two given languages is to provide the same capabilities that a person fluent in two languages would provide between those two languages. If a person fluent in English and some other language were to be asked "Translate the sentence Fire.", that person would ask for context. A properly-written NLP program would do the same thing. The same goes for your list of words grabbed at random from /usr/dict.

      The 'other' method of achieving the goal, that of translating all of the syntactic, semantic, contextual, environmental and every other "...al" meaning that is embodied within natural language into some machine encodable intermediate language. So that once so encoded, the translation to other langauges can be done by applying a set of language specific "construction rules", is even less feasible.

      Why is that so? I think that the problem is a problem of algorith and data structure. Most attempts I've seen (and my cousin wrote her masters on the very topic ... in French) attempts to follow standard sentence deconstruction, the kind you learned in English class. I think that this method fails to understand the purpose of language.

      Language, to my mind, is meant to convey concepts. Very fuzzy, un-boxable concepts. But, the only intermediate language we have is, well, language. So, we encode in a very lossy algorithm to words, phrases, sentences, and paragraphs. Then, the listener decodes in a similarly lossy algorithm (which isn't the same algorithm anyone else would use to decode the same text) into their framework of concepts. Usually, the paradigms are close enough or the communication is generic enough that transmission of concepts is possible. However, there are many instances, and I'm sure each of us has run into one, where the concepts we were trying to communicate did not get through. And, this is a problem, as you noted, not just a problem between speakers of different languages, but also between fluent speakers of the same language.

      I would like to note that such projects of constructing an intermediate language have successfully occurred in the past. The most notable example of this is the Chinese writing system. There are at least 5 major languages that use the exact same writing system. Thus, someone who speaks only Mandarin can communicate just fine with someone who speaks only Cantonese, solely by written communication. There are other examples, but none as wide-reaching. So, it is a feasible idea. And, I think, it's a critical idea. If we can come up with an intermediate language representing the actual concepts being communicated, that would revolutionize philosophy, linguistics, computer science, and a host of other fields. It's not a matter of whether this project is worthwhile. I think it's a matter of we cannot afford to not do it.

      Being right, does not endow the right to be rude; politeness costs nothing.
      Being unknowing, is not the same as being stupid.
      Expressing a contrary opinion, whether to the individual or the group, is more often a sign of deeper thought than of cantankerous belligerence.
      Do not mistake your goals as the only goals; your opinion as the only opinion; your confidence as correctness. Saying you know better is not the same as explaining you know better.

        that person would ask for context. A properly-written NLP program would do the same thing.

        If I sent you a /msg saying "Why do you deprecate tie and not bless?", you will immediately be able to respond to that question.

        Ask an (existing) NLP to translate that into & from any other language, or even host of other languages and you get:

        1. German: "Warum mißbilligen Sie Riegel und segnen nicht?"

          "Why do you disapprove latch plates and do segnen not?"

        2. French:"Pourquoi est-ce que vous désapprouvez la cravate et ne la bénissezpas?"

          "Why do you disapprove the tie and it bénissezpas?"

        3. Chinese: "?????????????".

          "Why do you belittle the tie and do not bless?"

        4. Japanese: "????????????????"

          "Why, you criticize the tie, don't praise?"

        5. Dutch:"Waarom keurt zegent niet u band af en?"

          "Why inspects don't you bless link finished and?"

        6. Russion:"?????? ?? deprecate ????? ? ?? ???????????????"

          "Why you deprecate connection and do not bless?"

        7. Korean:"?? ? ??? ?????? ???? ????"

          "Tie under criticizing it boils it doesn't bless why it spreads out?"

        That looked damned impressive when I pasted it, and rubbish now I've submitted it:(

        I'm only vaguely fluent in one of those languages, and I would be hard pushed to recognise the question I asked, even though I am completely aware of the context, background and content.

        How can a NLP "ask for context"? Most of the context of that question is completely absent from this post; this entire thread; some of it depends upon information only coveyed between us (you and I) through private communications. Without having all of the background information, and/or one of us to interrogate, can you ever imagine an NLP being able to communicate the essence of the question I am asking in another language?

        Even a human being, fluent in English and whichever other langauge(s) we want it translated into, would be extremely hard pushed to convey the essence of that question without they also have an pretty intimate knowledge of not just programming in general, but Perl 5 specifically. Indeed, the question would be confusing and essentially meaningless, even in English, to anyone without the requisite background.

        And that's my point. Human speech shorthands so much, on the basis of the speaker's knowledge of the of the listener's knowledge and experience. Try the mental exercise of just how much extra information would be required to allow another native English speaker, who has no knowledge of computers or Perl, to understand that question. I seriously doubt it could be done in less than 50,000 words?

        Now imagine trying to translate those 50,000 words into Navaho Indian, or Inuit such that a native of those languages without computer and Perl 5 experience could understand it?

        By now, your probably thinking "But the recipient of such a message would have that experience, otherwise you wouldn't be asking them that question", and you would be right. But it is my contention, that if the NLP is going to be able to convey the essence of the words 'tie' and 'bless' in the question, into suitably non-bondage-related and non-religious-related terms in the target language, it would need that same knowledge.

        Of course, then you might say that: "If the recipient knows Perl programming, then the is no need, and it would in fact be detrimental, to translate those terms at all". But then the NLP has to have that knowledge in order to know not to translate those two terms. It would also need to 'know' that the recipent had the knowledge to understand the untranslated terms!

        Apply that same logic to conversation between neurosurgeons, or particle physisists, or sushi chefs, or hair-stylists, or mothers.

        Spoken and written langauge is rife with supposed knowledge, and contextual inference. Just as I would have extreme difficulty trying to explain the background of the question to a Japanese Sushi chef. He would have extreme difficulty in explaining the process of preparing Blowfish to me.

        Not only can I see no way to encapsulate all that disparate knowledge into a computer program, neither can I see how to program the computer to ask the right questions to allow the translation of such information.

        And, I think, it's a critical idea. If we can come up with an intermediate language representing the actual concepts being communicated, that would revolutionize philosophy, linguistics, computer science, and a host of other fields. It's not a matter of whether this project is worthwhile. I think it's a matter of we cannot afford to not do it.

        I agree with the sentiment of this, but not the approach. Not because I wouldn't like it to suceed, but because I simply do not see the time when this will be feasible. Even with Moore's Law working for use (for how much longer?), I do not see the means by which it would be acheivable.

        I also think that the underlying problem will be resolved before we find a technological solution to it, in a rather more prosaic, but ultimately more practical fashion. I think over time, the diversity of human langauge will steadily reduce until the problem "goes away".

        I suspect that a single common langauge will become universal. I doubt it will be recognisably any one of the currently existing langauges. More a bastardisation of several of the more widely spoken ones all run together. I imagine that there will be a fairly high content of English (because it's lingua franca in so many fields already), French (because the French are stubborn enough to ensure it. Besides which it's too nice a language to allow to die), Chinese and one of the Indian sub-continental languages (because between them they cover about 1/3rd of the world's population), and probably many bits of many others.

        Basically, we (our children('s children?)) will all become bi-lingual. Our 'native' tongues, and "worldspeak".

        Always assuming that we don't run out of fuel, water or air before then!


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: X-Prize: Natural Language Processing
by hardburn (Abbot) on Oct 15, 2004 at 16:09 UTC

    Suggestion: Give a more rigrous testing style for the "arbitrarily chosen native speaker". Something like:

    The tester sits in one room with a computer connected to an IRC server in a private room. Two other users are allowed in the IRC room (but only one of them is in it at once), one of which is the program and the other is a second arbitrarily chosen native speaker. After an hour of questioning, the tester will make a guess as to which user is a program and which is a human. The test is repeated with other native speakers (up to some TBD number of tests). To win, the testers must guess incorrectly at least 50% of the time.

    This will probably need to be modified further, but should be a good start. It also adds the requirement that the program can talk over IRC, but I doubt that would be a challenge for anyone implementing a natural-language processor :)

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Re: X-Prize: Natural Language Processing
by pmtolk (Acolyte) on Oct 17, 2004 at 19:55 UTC
    I cut and pasted from from A. Cottrell's research on Indian-Western couples living in India
    which is an exerpt from the book "The use and misuse of Language" the article by Edmund S Glenn entitled "Semantic difficulties in international Communication" Good and cheap book, worth the read


    Glenn, in "Semantic Difficulties in International Communication" (also in Hayakawa) argues that difficulties transmitting ideas of one national or cultural group to another is not merely a problem of language, but is more a matter of the philosophy of the individual(s) communicating which determines how they see things and how they express their ideas. Philosopies or ideas, he feels, are what distinguish one culture group from another. "...what is meant by (national character) is in reality the embodiment of a philosophy or the habitual use of a method of judging and thinking.: (P 48) "The determination of the relationship between the patterns of thought of the cultural or national group whose ideas are to be communicated, to the patterns of thought of the cultural or national group wihich is to receive the communication, is an integral part of international communication. Failure to determine such relationships and to act in accordance with such determinations, will almost unavoidably lead to misunderstandings." Glenn gives examples of difference of philosophy in communication misunderstandings among nations based on UN debates. Also some examples which might be experienced by cross-cultural couples. For example: to the English No means No, to an Arab No means yes, but let's negotiate or discuss further (a "real" no has added emphasis) ...Indians say no when they mean yes regarding food or hospitality offered.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://399552]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (7)
As of 2024-03-28 21:03 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found