Re: Finding recurring phrases

this may help .. (this is a just a hacked up example!!! )

#!/usr/bin/perl -w

use strict;

my $text = <<MYTEXT_MYTEXT;
This is my first sentence, Ok?
This is another one. Leonardo da Vinci died at Clos Lucé, France, on 2
+nd May, 1519.
The only solution I can think of is to loop through the text, word by 
+word, 
and search the remaining text for multiple occurences of that word. If
+ found, 
check if the successive words are the same.

But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
"of the" is a phrase that's apt to recur (and many times) in many docu
+ments. 
Do you care? Or do you really mean that the ONLY recurring phrase you 
+care about is 
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?

And while the "speed" will depend (in part) on your algorithm, the tim
+e the process 
will take to run to completion will likely be most influenced by the s
+ize of the text
to search and the specificity (or simplicity) of the search phrase (hi
+nt: read 
"regular expression"), for any given language and box upon which to ru
+n it.
MYTEXT_MYTEXT


$text=~s/\n|\t/ /sg;





my @phrases = split(/\.|\?|\!/,$text);
 
 
# let's allow room for similarity
# by making a digest of each phrase

my %phrases=();
my %digests=();

for ( @phrases ){
    
    my $phrase=$_;
    $phrase=~/\w/ or next;
    my $digest=lc($phrase);
    $digest=~s/\W|\s|\d//g;
    
    $phrases{$phrase}=$digest;
    $digests{$digest}++;    
}

my $count =0;
for (@phrases){
    my $phrase=$_;
    $phrase=~/\w/ or next;

    print STDERR "$count) phrase [[[$phrase]]]\ndigest [[[$phrases{$ph
+rase}]]]\n"
    ."digest occurrences: ".$digests{$phrases{$phrase}}."\n\n"; 
    $count++;
}
[download]

Produces as output:

[leo@mescaline ~]$ perl recurring.pl
0) phrase [[[This is my first sentence, Ok]]]
digest [[[thisismyfirstsentenceok]]]
digest occurrences: 1

1) phrase [[[ This is another one]]]
digest [[[thisisanotherone]]]
digest occurrences: 1

2) phrase [[[ Leonardo da Vinci died at Clos Lucé, France, on 2nd May,
+ 1519]]]
digest [[[leonardodavincidiedatcloslucfranceonndmay]]]
digest occurrences: 1

3) phrase [[[ The only solution I can think of is to loop through the 
+text, word by word,  and search the remaining text for multiple occur
+ences of that word]]]
digest [[[theonlysolutionicanthinkofistoloopthroughthetextwordbywordan
+dsearchtheremainingtextformultipleoccurencesofthatword]]]
digest occurrences: 1

4) phrase [[[ If found,  check if the successive words are the same]]]
digest [[[iffoundcheckifthesuccessivewordsarethesame]]]
digest occurrences: 1

5) phrase [[[  But that method is very slow, as I need to loop through
+ the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4

6) phrase [[[ I'm wondering if there's a way to do it more efficient]]
+]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4

7) phrase [[[ But that method is very slow, as I need to loop through 
+the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4

8) phrase [[[ I'm wondering if there's a way to do it more efficient]]
+]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4

9) phrase [[[ But that method is very slow, as I need to loop through 
+the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4

10) phrase [[[ I'm wondering if there's a way to do it more efficient]
+]]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4

11) phrase [[[ But that method is very slow, as I need to loop through
+ the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4

12) phrase [[[ I'm wondering if there's a way to do it more efficient]
+]]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4

13) phrase [[[ "of the" is a phrase that's apt to recur (and many time
+s) in many documents]]]
digest [[[oftheisaphrasethatsapttorecurandmanytimesinmanydocuments]]]
digest occurrences: 1

14) phrase [[[  Do you care]]]
digest [[[doyoucare]]]
digest occurrences: 1

15) phrase [[[ Or do you really mean that the ONLY recurring phrase yo
+u care about is  "Leonardo da Vinci" or something similarly restricte
+d]]]
digest [[[ordoyoureallymeanthattheonlyrecurringphraseyoucareaboutisleo
+nardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 1

16) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6

17) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6

18) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6

19) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6

20) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6

21) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6

22) phrase [[[  And while the "speed" will depend (in part) on your al
+gorithm, the time the process  will take to run to completion will li
+kely be most influenced by the size of the text to search and the spe
+cificity (or simplicity) of the search phrase (hint: read  "regular e
+xpression"), for any given language and box upon which to run it]]]
digest [[[andwhilethespeedwilldependinpartonyouralgorithmthetimethepro
+cesswilltaketoruntocompletionwilllikelybemostinfluencedbythesizeofthe
+texttosearchandthespecificityorsimplicityofthesearchphrasehintreadreg
+ularexpressionforanygivenlanguageandboxuponwhichtorunit]]]
digest occurrences: 1
[download]

Comment on Re: Finding recurring phrases Select or Download Code