Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for a way to find all recurring phrases in some documents. Example: Given this text I want to detect the phrase "Leonardo da Vinci" as it appears twice:
Leonard of Quirm, a character in the Discworld series of novels, is based largely on Leonardo da Vinci. Leonardo da Vinci died at Clos Lucé, France, on 2nd May, 1519.
The only solution I can think of is to loop through the text, word by word, and search the remaining text for multiple occurences of that word. If found, check if the successive words are the same, and so on...

But that method is very slow, as I need to loop through the content many times.
I'm wondering if there's a way to do it more efficient...

Len

Replies are listed 'Best First'.
Re: Finding recurring phrases
by diotalevi (Canon) on May 16, 2006 at 18:55 UTC

    Check this out. It finds the reoccuring phrases "of", "on", "Leonardo da Vinci", "da Vinci", and "Vinci".

    $_ = <<"..."; Leonard of Quirm, a character in the Discworld series of novels, is based largely on Leonardo da Vinci. Leonardo da Vinci died at Clos Lucé, France, on 2nd May, 1519. ... # Normalize the whitespace s/\s+/ /g; my $RX = qr/ # NODE EXPLANATION # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char # -------------------------------------------------------------------- +-- ( # group and capture to \1: # -------------------------------------------------------------------- +-- \w+ # word characters (a-z, A-Z, 0-9, _) (1 o +r # more times (matching the most amount # possible)) # -------------------------------------------------------------------- +-- (?: # group, but do not capture (0 or more # times (matching the most amount # possible)): # -------------------------------------------------------------------- +-- \s+ # whitespace (\n, \r, \t, \f, and " ") # (1 or more times (matching the most # amount possible)) # -------------------------------------------------------------------- +-- \w+ # word characters (a-z, A-Z, 0-9, _) (1 # or more times (matching the most # amount possible)) # -------------------------------------------------------------------- +-- )* # end of grouping # -------------------------------------------------------------------- +-- ) # end of \1 # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char # -------------------------------------------------------------------- +-- .+? # any character except \n (1 or more times # (matching the least amount possible)) # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char # -------------------------------------------------------------------- +-- \1 # what was matched by capture \1 # -------------------------------------------------------------------- +-- \b # the boundary between a word char (\w) and # something that is not a word char /xms; while ( /$RX/sg ) { pos() = $-[0] + 1; print "<$1>\n"; }

    ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      You may be interested to know that I just tried this with 100 paragraphs of the lorem ipsum text (see http://www.lipsum.com/) and my perl segfaulted immediately. I don't know how big the OP's source text is, but it can't be too big or perl falls down.

        Use a perl without a stack based regex engine and that limitation is removed.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

      Wow... That's some real regex magic :) Performance is not bad.. Took 20sec. to find all recurring phrases in 100 paragraphs of lorem ipsum.

      Is there a way to count the number of occurences ?

Re: Finding recurring phrases
by salva (Canon) on May 16, 2006 at 19:18 UTC
    a solution with O(NlogN) cost and with low memory consumption:
    use strict; use warnings; my $data = do { local $/; lc <DATA> }; my @words = $data =~ /\w+/g; { no warnings; sub cmpix { for (my $off = 0;; $off++) { my $cmp = $words[$a+$off] cmp $words[$b+$off]; return $cmp if $cmp; } } } my @ixs = sort cmpix 0..$#words; # replace this code with duplication detector (it # should be easy!): for my $ix (@ixs) { print substr(join(" ", @words[$ix..$#words]), 0, 50), "\n"; } __DATA__ ...
    that outputs...
    $ perl ./t.pl 1519 2nd may 1519 a character in the discworld series of novels is b at clos luc france on 2nd may 1519 based largely on leonardo da vinci leonardo da vin character in the discworld series of novels is bas clos luc france on 2nd may 1519 da vinci died at clos luc france on 2nd may 1519 da vinci leonardo da vinci died at clos luc france died at clos luc france on 2nd may 1519 discworld series of novels is based largely on leo france on 2nd may 1519 in the discworld series of novels is based largely is based largely on leonardo da vinci leonardo da largely on leonardo da vinci leonardo da vinci die leonard of quirm a character in the discworld seri leonardo da vinci died at clos luc france on 2nd m leonardo da vinci leonardo da vinci died at clos l luc france on 2nd may 1519 may 1519 novels is based largely on leonardo da vinci leona of novels is based largely on leonardo da vinci le of quirm a character in the discworld series of no on 2nd may 1519 on leonardo da vinci leonardo da vinci died at clo quirm a character in the discworld series of novel series of novels is based largely on leonardo da v the discworld series of novels is based largely on vinci died at clos luc france on 2nd may 1519 vinci leonardo da vinci died at clos luc france on
    then finding duplicates is pretty obvious as they appear in consecutive entries in @ixs.
      Brilliant ! Thanks salva. I'm going to use this solution. It's a clean and fast !
Re: Finding recurring phrases
by GrandFather (Saint) on May 16, 2006 at 19:16 UTC

    You may be interested in Fast common substring matching. It's intended for a very different application domain, but may suit the search you wish to perform ("find all recurring phrases").


    DWIM is Perl's answer to Gödel
Re: Finding recurring phrases
by TedPride (Priest) on May 16, 2006 at 23:01 UTC
    Not so efficient, but output divided by words and counted. Sorting substrings alone is easy.
    use strict; use warnings; my $words = 1; # Phrase must contain at least 1 word my $size = 5; # Phrase must be at least 5 characters long my $matches = 2; # Must be at least 2 copies of phrase my (@words, @pos, @matches, $p, $c, $key); $_ = join '', <DATA>; $_ = lc($_); @words = m/\w+(?:'\w+)?/g; @pos = sort { mycmp($a, $b) } 0..$#words; for $p ($words..$#pos) { $c = mycount($p); for (1..$c) { $key = join ' ', @words[$pos[$p]..($pos[$p]+$_-1)]; next if length($key) < $size; $matches[$_]{$key}++; } } for (reverse $words..$#matches) { print "$_ words:\n"; $c = $matches[$_]; for (sort { $c->{$b} <=> $c->{$a} } keys %$c) { last if $c->{$_} < $matches - 1; print " $_ : ",($c->{$_}+1),"\n"; } } sub mycount { my $x = $pos[$_[0]]; my $y = $pos[$_[0]-1]; my $c = 0; $c++ while $x <= $#words && $y <= $#words && $words[$x++] eq $word +s[$y++]; return $c; } sub mycmp { my ($x, $y) = @_; while ($x <= $#words && $y <= $#words) { return $c if $c = $words[$x++] cmp $words[$y++]; } return $x <=> $y; } __DATA__ Section. 1. All legislative Powers herein granted shall be vested in a Congress of + the United States, which shall consist of a Senate and House of Repr +esentatives. Section. 2. The House of Representatives shall be composed of Members chosen every + second Year by the People of the several States, and the Electors in + each State shall have the Qualifications requisite for Electors of t +he most numerous Branch of the State Legislature. No Person shall be a Representative who shall not have attained to the + Age of twenty five Years, and been seven Years a Citizen of the Unit +ed States, and who shall not, when elected, be an Inhabitant of that +State in which he shall be chosen.
    I suppose I could write this to use a linear counting method of some sort, but that's for another day.
Re: Finding recurring phrases
by ww (Archbishop) on May 16, 2006 at 18:45 UTC

    Standard Reply:

    What have you tried? where's your code?

    Non-standard caveat:

    How do you intend to define "recurring phrases?"

    "of the" is a phrase that's apt to recur (and many times) in many documents. Do you care? Or do you really mean that the ONLY recurring phrase you care about is "Leonardo da Vinci" or something similarly restricted?

    And while the "speed" will depend (in part) on your algorithm, the time the process will take to run to completion will likely be most influenced by the size of the text to search and the specificity (or simplicity) of the search phrase (hint: read "regular expression"), for any given language and box upon which to run it.

    So, please, rethink your question, a bit, CORRECTION, duh! and update it  (anonymonk can't update) add info as new comment to provide additional detail.

    pertinent update from anonymonk! --\v

      "of the" need to be found, but will be filtered out afterwards by a rule that prevents phrases to end with a stopword.

      What I'm currently doing is this (for 2 word phrases):

      sub add_content { my $self = shift; my $content = shift; $words = [ split(/\s+/, $content) ]; for ($i=0; $i < scalar(@$words) ; $i++) { my $first_word = lc($words->[$i]); my $second_word = lc($words->[$i+1]); # 2 word phrases if ($self->is_relevant_word($first_word , $second_word +) && $first_word ne "$second_word") { my $phrase = $first_word . " " . $second_word; $self->{_related}{$phrase}++; $self->_rate_phrase($phrase); } } }
      I'm just counting the occurences of phrases like this: $hash{$phrase}++ , and afterwards look for hash elements with values > 1.
Re: Finding recurring phrases
by planetscape (Chancellor) on May 17, 2006 at 10:32 UTC

    Depending on where you're going with all this, you may want to Super Search for "ngrams" (I like Ted Pedersen's Ngram Statistics Package), "concordance" or "KWIC" (key word in context). My favorite KWIC is collocate, but there are several versions here on PM as well.

    HTH,

    planetscape
Re: Finding recurring phrases
by leocharre (Priest) on May 16, 2006 at 19:17 UTC

    this may help .. (this is a just a hacked up example!!! )

    #!/usr/bin/perl -w use strict; my $text = <<MYTEXT_MYTEXT; This is my first sentence, Ok? This is another one. Leonardo da Vinci died at Clos Lucé, France, on 2 +nd May, 1519. The only solution I can think of is to loop through the text, word by +word, and search the remaining text for multiple occurences of that word. If + found, check if the successive words are the same. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. But that method is very slow, as I need to loop through the content ma +ny times. I'm wondering if there's a way to do it more efficient. "of the" is a phrase that's apt to recur (and many times) in many docu +ments. Do you care? Or do you really mean that the ONLY recurring phrase you +care about is "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? "Leonardo da Vinci" or something similarly restricted? And while the "speed" will depend (in part) on your algorithm, the tim +e the process will take to run to completion will likely be most influenced by the s +ize of the text to search and the specificity (or simplicity) of the search phrase (hi +nt: read "regular expression"), for any given language and box upon which to ru +n it. MYTEXT_MYTEXT $text=~s/\n|\t/ /sg; my @phrases = split(/\.|\?|\!/,$text); # let's allow room for similarity # by making a digest of each phrase my %phrases=(); my %digests=(); for ( @phrases ){ my $phrase=$_; $phrase=~/\w/ or next; my $digest=lc($phrase); $digest=~s/\W|\s|\d//g; $phrases{$phrase}=$digest; $digests{$digest}++; } my $count =0; for (@phrases){ my $phrase=$_; $phrase=~/\w/ or next; print STDERR "$count) phrase [[[$phrase]]]\ndigest [[[$phrases{$ph +rase}]]]\n" ."digest occurrences: ".$digests{$phrases{$phrase}}."\n\n"; $count++; }

    Produces as output:

    [leo@mescaline ~]$ perl recurring.pl 0) phrase [[[This is my first sentence, Ok]]] digest [[[thisismyfirstsentenceok]]] digest occurrences: 1 1) phrase [[[ This is another one]]] digest [[[thisisanotherone]]] digest occurrences: 1 2) phrase [[[ Leonardo da Vinci died at Clos Lucé, France, on 2nd May, + 1519]]] digest [[[leonardodavincidiedatcloslucfranceonndmay]]] digest occurrences: 1 3) phrase [[[ The only solution I can think of is to loop through the +text, word by word, and search the remaining text for multiple occur +ences of that word]]] digest [[[theonlysolutionicanthinkofistoloopthroughthetextwordbywordan +dsearchtheremainingtextformultipleoccurencesofthatword]]] digest occurrences: 1 4) phrase [[[ If found, check if the successive words are the same]]] digest [[[iffoundcheckifthesuccessivewordsarethesame]]] digest occurrences: 1 5) phrase [[[ But that method is very slow, as I need to loop through + the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 6) phrase [[[ I'm wondering if there's a way to do it more efficient]] +] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 7) phrase [[[ But that method is very slow, as I need to loop through +the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 8) phrase [[[ I'm wondering if there's a way to do it more efficient]] +] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 9) phrase [[[ But that method is very slow, as I need to loop through +the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 10) phrase [[[ I'm wondering if there's a way to do it more efficient] +]] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 11) phrase [[[ But that method is very slow, as I need to loop through + the content many times]]] digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim +es]]] digest occurrences: 4 12) phrase [[[ I'm wondering if there's a way to do it more efficient] +]] digest [[[imwonderingiftheresawaytodoitmoreefficient]]] digest occurrences: 4 13) phrase [[[ "of the" is a phrase that's apt to recur (and many time +s) in many documents]]] digest [[[oftheisaphrasethatsapttorecurandmanytimesinmanydocuments]]] digest occurrences: 1 14) phrase [[[ Do you care]]] digest [[[doyoucare]]] digest occurrences: 1 15) phrase [[[ Or do you really mean that the ONLY recurring phrase yo +u care about is "Leonardo da Vinci" or something similarly restricte +d]]] digest [[[ordoyoureallymeanthattheonlyrecurringphraseyoucareaboutisleo +nardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 1 16) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 17) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 18) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 19) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 20) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 21) phrase [[[ "Leonardo da Vinci" or something similarly restricted]] +] digest [[[leonardodavinciorsomethingsimilarlyrestricted]]] digest occurrences: 6 22) phrase [[[ And while the "speed" will depend (in part) on your al +gorithm, the time the process will take to run to completion will li +kely be most influenced by the size of the text to search and the spe +cificity (or simplicity) of the search phrase (hint: read "regular e +xpression"), for any given language and box upon which to run it]]] digest [[[andwhilethespeedwilldependinpartonyouralgorithmthetimethepro +cesswilltaketoruntocompletionwilllikelybemostinfluencedbythesizeofthe +texttosearchandthespecificityorsimplicityofthesearchphrasehintreadreg +ularexpressionforanygivenlanguageandboxuponwhichtorunit]]] digest occurrences: 1
Re: Finding recurring phrases
by Herkum (Parson) on May 16, 2006 at 18:45 UTC

    Load your document as one big string and search that,

    open my $fh, '<', 'target.txt'; ## open your file my $file; { local $/ = undef; ## disable the line seperator variable. $file = <$fh>; ## load your file to a string. close $fh; ## close your file } my $phrase = qr{ ## begin phrase Leonardo \s+ ## needed to avoid weird whitespace issues da \s+ ## needed to avoid weird whitespace issues Vinci }gxms; ## match all occurences of our phrase # Returns an array with all the matches my @matches = $file =~ $phrase; # Returns a count of how many matches you had for your string print "Found " . @matches . " matches\n";
A reply falls below the community's threshold of quality. You may see it by logging in.