I would like to be able to somehow compare the strings in such a way that the fact that the quote is already entered shows up, despite the fact that it's not exactly the same, either shorter or longer or slightly differently written in some aspects.
I've seen a simple approach to a similar problem. It went something like:
- Convert each string to an array of words.
- Delete "noise" words ("a", "the", "to", etc.) from each array.
- Convert each remaining word into a "canonical" form (e.g., all lower-case, apostrophe's deleted.)
- Compute a score based on the Hamming Distance between the source array is to the target array. (Hamming distance is a measure of how close two strings are to each other. In one form*, it considers transpositions and deletions.)
The two tricky parts are settling on a cannonical form, which can get quite involved if you want to consider stemming, and the Hamming distance calculation.
Try the simple approach first: convert all words to lower-case, drop noise words, then see how close the two arrays are in terms of common elements in a common order.
----
*I did a quick google search. Most descriptions talk in terms of Hamming's original definition, which was in terms of bit flips. I've seen this applied to strings of arbitrary data somewhere -- perhaps in an Algorithms text. The string comparison that limewire uses to guess at whether a search string matches an MP3 title might also be worth checking, though it's in Java.