Large sets are actually easier to use, from a searching standpoint. As each sentence is entered, identify what part of speech each word is (I assume you'll be doing that already). Now store word counts for each word as each part of speech, along with a list of the sentences that word belongs to.
I LOVE bread and butter.
LOVE is beautiful.
Love in one sentence is a verb and in the other a subject. The two should be kept separate.
As your sample grows, you should be able to get fairly accurate matches by adding up the weights for each word/part of speech x the number of times the word appears in the sentence. You only need to look at sentences containing key words and a match percentage over a certain level, which means that your heavy-duty algorithm will probably never need to do more than a few dozen sentences even with hundreds of thousands of sentences.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.