I'm looking for a way to find all recurring phrases in some documents.
Example: Given this text I want to detect the phrase "Leonardo da Vinci" as it appears twice:
Leonard of Quirm, a character in the Discworld series of novels,
is based largely on Leonardo da Vinci.
Leonardo da Vinci died at Clos Lucé, France, on 2nd May, 1519.
The only solution I can think of is to loop through the text, word by word, and search the remaining text for multiple occurences of that word. If found, check if the successive words are the same, and so on...
But that method is very slow, as I need to loop through the content many times.
I'm wondering if there's a way to do it more efficient...
Len
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.