in reply to Re: Finding recurring phrases
in thread Finding recurring phrases

"of the" need to be found, but will be filtered out afterwards by a rule that prevents phrases to end with a stopword.

What I'm currently doing is this (for 2 word phrases):

sub add_content { my $self = shift; my $content = shift; $words = [ split(/\s+/, $content) ]; for ($i=0; $i < scalar(@$words) ; $i++) { my $first_word = lc($words->[$i]); my $second_word = lc($words->[$i+1]); # 2 word phrases if ($self->is_relevant_word($first_word , $second_word +) && $first_word ne "$second_word") { my $phrase = $first_word . " " . $second_word; $self->{_related}{$phrase}++; $self->_rate_phrase($phrase); } } }
I'm just counting the occurences of phrases like this: $hash{$phrase}++ , and afterwards look for hash elements with values > 1.