Two thoughts come to mind.

One, how similar must two strings be (a proposed paragraph and an existing paragraph) to be considered identical?

Now is the time for all good Men to come to the aid of their Country.
Now is the time, for all good men, to come to the aid of their count +ry!

Two, once you have a canonical paragraph, find a good way to hash or digest the paragraph to something that is quick and easy to compare later. For example, Digest::SHA1 or Digest::MD5. The digests are fast to compare, and you can even fit them into an in-memory hashtable or save them to a separate file.

Remember to boil down the paragraph to the most canonical form possible, so that you won't get many false-positives that are different in irrelevant ways. Some examples of this might include changing all multiple whitespace to single spaces, lowercasing everything, and removing diacritical marks or some forms of punctuation. The resulting string is not ready to display anymore, but it is ready to hash or digest.

--
[ e d @ h a l l e y . c c ]


In reply to Re: Efficiency: Finding if a file contains a paragraph by halley
in thread Efficiency: Finding if a file contains a paragraph by C_T

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.