Re: Theory time: Sentence equivalence

Large sets are actually easier to use, from a searching standpoint. As each sentence is entered, identify what part of speech each word is (I assume you'll be doing that already). Now store word counts for each word as each part of speech, along with a list of the sentences that word belongs to.

I LOVE bread and butter.
LOVE is beautiful.

Love in one sentence is a verb and in the other a subject. The two should be kept separate.

As your sample grows, you should be able to get fairly accurate matches by adding up the weights for each word/part of speech x the number of times the word appears in the sentence. You only need to look at sentences containing key words and a match percentage over a certain level, which means that your heavy-duty algorithm will probably never need to do more than a few dozen sentences even with hundreds of thousands of sentences.

Comment on Re: Theory time: Sentence equivalence