in reply to Finding recurring phrases

Standard Reply:

What have you tried? where's your code?

Non-standard caveat:

How do you intend to define "recurring phrases?"

"of the" is a phrase that's apt to recur (and many times) in many documents. Do you care? Or do you really mean that the ONLY recurring phrase you care about is "Leonardo da Vinci" or something similarly restricted?

And while the "speed" will depend (in part) on your algorithm, the time the process will take to run to completion will likely be most influenced by the size of the text to search and the specificity (or simplicity) of the search phrase (hint: read "regular expression"), for any given language and box upon which to run it.

So, please, rethink your question, a bit, CORRECTION, duh! and update it  (anonymonk can't update) add info as new comment to provide additional detail.

pertinent update from anonymonk! --\v

Replies are listed 'Best First'.
Re^2: Finding recurring phrases
by Anonymous Monk on May 16, 2006 at 19:35 UTC
    "of the" need to be found, but will be filtered out afterwards by a rule that prevents phrases to end with a stopword.

    What I'm currently doing is this (for 2 word phrases):

    sub add_content { my $self = shift; my $content = shift; $words = [ split(/\s+/, $content) ]; for ($i=0; $i < scalar(@$words) ; $i++) { my $first_word = lc($words->[$i]); my $second_word = lc($words->[$i+1]); # 2 word phrases if ($self->is_relevant_word($first_word , $second_word +) && $first_word ne "$second_word") { my $phrase = $first_word . " " . $second_word; $self->{_related}{$phrase}++; $self->_rate_phrase($phrase); } } }
    I'm just counting the occurences of phrases like this: $hash{$phrase}++ , and afterwards look for hash elements with values > 1.