I would like to be able to somehow compare the strings in such a way that the fact that the quote is already entered shows up, despite the fact that it's not exactly the same, either shorter or longer or slightly differently written in some aspects.
I've seen a simple approach to a similar problem. It went something like:
- Convert each string to an array of words.
- Delete "noise" words ("a", "the", "to", etc.) from each array.
- Convert each remaining word into a "canonical" form (e.g., all lower-case, apostrophe's deleted.)
- Compute a score based on the Hamming Distance between the source array is to the target array. (Hamming distance is a measure of how close two strings are to each other. In one form*, it considers transpositions and deletions.)
The two tricky parts are settling on a cannonical form, which can get quite involved if you want to consider stemming, and the Hamming distance calculation.
Try the simple approach first: convert all words to lower-case, drop noise words, then see how close the two arrays are in terms of common elements in a common order.
----
*I did a quick google search. Most descriptions talk in terms of Hamming's original definition, which was in terms of bit flips. I've seen this applied to strings of arbitrary data somewhere -- perhaps in an Algorithms text. The string comparison that limewire uses to guess at whether a search string matches an MP3 title might also be worth checking, though it's in Java.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.