in reply to Theory time: Sentence equivalence

Pointers to research?

At one level, this is what research in computational linguistics has been working on for years. Before attempting to invent (note that I didn't say reinvent :) this particular wheel by yourself, try googling for "constraint based grammars", "machine translation", "semantic equivalence" or whatever.

Moreover, I think you have a misconception over what constitutes "equivalence". The non-trivial issues of word order and punctuation have been pointed out above. A further problem is that sentences can be "equivalent", that not only have different grammatical structures but also contain very different words. For example:

- In New York, following the latest Fed rate cut, stocks rose across the board.

- The Federal Bank's further lowering of base rates boosted the NYSE and the NASDAQ.

- Wall Street reacted positively after Greenspan reduced interest rates again.

may be considered strictly equivalent, at least for some definition of equivalence, despite containing very few words in common.

Anyway, best of luck in your project. I, for one, would be very interested in seeing your results.

dave

Update: Fixed typos, changed sample sentences slightly to make them more "equivalent".

Replies are listed 'Best First'.
Re^2: Theory time: Sentence equivalence
by eweaver (Sexton) on Dec 18, 2005 at 07:34 UTC

    I don't think you realize what you've gotten into. Research into natural language processing (the official term) is at least 40 years old. Browse on citeulike for NLP, NLG (natural language generation), and NLU (...understanding). Also search Google Scholar.

    The topic is extremely theoretical and heavily realizes on Chompsky's theories of grammars, as well as tree theory. There are two main approaches: McKeown's (a name you will see a lot) is usually based on nested templates (IIRC) and requires a massive database of grammatical structures. Other approaches rely on deep semantic representations of the knowledge that is implied by grammatical structures.

    A couple of things that you have failed to consider, each of which has massive amounts of discussion in the literature:
    - Referring expressions (him, her, that, etc.)
    - Focus (what's the subject? Is the sentence really _about_ the subject?)
    - homonymns (semantic homonymns, at least)
    - fifty billion other things...

    The dismal state of machine translation ought to indicate how far yet we have to go. Babelfish is rather state of the art, actually.

    Good luck.

    ~e
      On mondays, I think linguists -- Chomsky, Pinker, the lot of them -- are pseudoscientists peddling bumhug. Kind of like certain bad apple social scientists and continental philosophers -- see The Sokal Affair.

      On tuesdays, I think maybe linguists are more like physicists than the wizards sokal pulled back the curtain on.

      Enh, who knows. But a good place to start on the bad news of solving linguistics problems with computing is Pinker's The Language Instinct. Bumhug he may be, on mondays, but I liked the book it a lot anyways :)

      Babelfish is rather state of the art, actually
      ...mmmm not really. Babelfish/Systran may be the biggest fish in the pond commerically, but they are hardly state of the art. The big splashes technically are being made by people looking at statistical techniques such as Language Weaver (the company I work for), ISI, and Google.

      As for the original topic. I will just echo some of the other posters and warn you that you are taking some tiny first (mis?)steps on a journey whose destination is a long ways off. This is an intensely complex and interesting topic you could spend your entire life on if you are sufficiently interested/motivated/compensated.

      --DrWhy

      "If God had meant for us to think for ourselves he would have given us brains. Oh, wait..."