this may help .. (this is a just a hacked up example!!! )

#!/usr/bin/perl -w use strict; my $text = <<MYTEXT_MYTEXT; This is my first sentence, Ok? This is another one. Leonardo da Vinci died at Clos Lucé, France, on 2 +nd May, 1519. The only solution I can think of is to loop through the text, word by +word, and search the remaining text for multiple occurences of that word. If + found, check if the successive words are the same. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. "of the" is a phrase that's apt to recur (and many times) in many docu +ments. Do you care? Or do you really mean that the ONLY recurring phrase you +care about is "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? And while the "speed" will depend (in part) on your algorithm, the tim +e the process will take to run to completion will likely be most influenced by the s +ize of the text to search and the specificity (or simplicity) of the search phrase (hi +nt: read "regular expression"), for any given language and box upon which to ru +n it. MYTEXT_MYTEXT $text=~s/\n|\t/ /sg; my @phrases = split(/\.|\?|\!/,$text); # let's allow room for similarity # by making a digest of each phrase my %phrases=(); my %digests=(); for ( @phrases ){ my $phrase=$_; $phrase=~/\w/ or next; my $digest=lc($phrase); $digest=~s/\W|\s|\d//g; $phrases{$phrase}=$digest; $digests{$digest}++; } my $count =0; for (@phrases){ my $phrase=$_; $phrase=~/\w/ or next; print STDERR "$count) phrase [[[$phrase]]]\ndigest [[[$phrases{$ph +rase}]]]\n" ."digest occurrences: ".$digests{$phrases{$phrase}}."\n\n"; $count++; }

Produces as output:

[leo@mescaline ~]$ perl recurring.pl 0) phrase [[[This is my first sentence, Ok]]] digest [[[thisismyfirstsentenceok]]] digest occurrences: 1 1) phrase [[[ This is another one]]] digest [[[thisisanotherone]]] digest occurrences: 1 2) phrase [[[ Leonardo da Vinci died at Clos Lucé, France, on 2nd May, + 1519]]] digest [[[leonardodavincidiedatcloslucfranceonndmay]]] digest occurrences: 1 3) phrase [[[ The only solution I can think of is to loop through the +text, word by word, and search the remaining text for multiple occur +ences of that word]]] digest [[[theonlysolutionicanthinkofistoloopthroughthetextwordbywordan +dsearchtheremainingtextformultipleoccurencesofthatword]]] digest occurrences: 1 4) phrase [[[ If found, check if the successive words are the same]]] digest [[[iffoundcheckifthesuccessivewordsarethesame]]] digest occurrences: 1 5) phrase [[[ But that method is very slow, as I need to loop through + the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 6) phrase [[[ I'm wondering if there's a way to do it more efficient]] +] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 7) phrase [[[ But that method is very slow, as I need to loop through +the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 8) phrase [[[ I'm wondering if there's a way to do it more efficient]] +] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 9) phrase [[[ But that method is very slow, as I need to loop through +the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 10) phrase [[[ I'm wondering if there's a way to do it more efficient] +]] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 11) phrase [[[ But that method is very slow, as I need to loop through + the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 12) phrase [[[ I'm wondering if there's a way to do it more efficient] +]] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 13) phrase [[[ "of the" is a phrase that's apt to recur (and many time +s) in many documents]]] digest [[[oftheisaphrasethatsapttorecurandmanytimesinmanydocuments]]] digest occurrences: 1 14) phrase [[[ Do you care]]] digest [[[doyoucare]]] digest occurrences: 1 15) phrase [[[ Or do you really mean that the ONLY recurring phrase yo +u care about is "Leonardo da Vinci" or something similarly restricte +d]]] digest [[[ordoyoureallymeanthattheonlyrecurringphraseyoucareaboutisleo +nardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 1 16) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 17) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 18) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 19) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 20) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 21) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 22) phrase [[[ And while the "speed" will depend (in part) on your al +gorithm, the time the process will take to run to completion will li +kely be most influenced by the size of the text to search and the spe +cificity (or simplicity) of the search phrase (hint: read "regular e +xpression"), for any given language and box upon which to run it]]] digest [[[andwhilethespeedwilldependinpartonyouralgorithmthetimethepro +cesswilltaketoruntocompletionwilllikelybemostinfluencedbythesizeofthe +texttosearchandthespecificityorsimplicityofthesearchphrasehintreadreg +ularexpressionforanygivenlanguageandboxuponwhichtorunit]]] digest occurrences: 1

In reply to Re: Finding recurring phrases by leocharre
in thread Finding recurring phrases by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.