One, how similar must two strings be (a proposed paragraph and an existing paragraph) to be considered identical?
Now is the time for all good Men to come to the aid of their Country.
Now is the time, for all good men, to come to the aid of their count +ry!
Two, once you have a canonical paragraph, find a good way to hash or digest the paragraph to something that is quick and easy to compare later. For example, Digest::SHA1 or Digest::MD5. The digests are fast to compare, and you can even fit them into an in-memory hashtable or save them to a separate file.
Remember to boil down the paragraph to the most canonical form possible, so that you won't get many false-positives that are different in irrelevant ways. Some examples of this might include changing all multiple whitespace to single spaces, lowercasing everything, and removing diacritical marks or some forms of punctuation. The resulting string is not ready to display anymore, but it is ready to hash or digest.
--
[ e d @ h a l l e y . c c ]
In reply to Re: Efficiency: Finding if a file contains a paragraph
by halley
in thread Efficiency: Finding if a file contains a paragraph
by C_T
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |