Re: Finding recurring phrases
by diotalevi (Canon) on May 16, 2006 at 18:55 UTC
|
Check this out. It finds the reoccuring phrases "of", "on", "Leonardo da Vinci", "da Vinci", and "Vinci".
$_ = <<"...";
Leonard of Quirm, a character in the Discworld series of novels,
is based largely on Leonardo da Vinci.
Leonardo da Vinci died at Clos Lucé, France, on 2nd May, 1519.
...
# Normalize the whitespace
s/\s+/ /g;
my $RX = qr/
# NODE EXPLANATION
# --------------------------------------------------------------------
+--
\b # the boundary between a word char (\w) and
# something that is not a word char
# --------------------------------------------------------------------
+--
( # group and capture to \1:
# --------------------------------------------------------------------
+--
\w+ # word characters (a-z, A-Z, 0-9, _) (1 o
+r
# more times (matching the most amount
# possible))
# --------------------------------------------------------------------
+--
(?: # group, but do not capture (0 or more
# times (matching the most amount
# possible)):
# --------------------------------------------------------------------
+--
\s+ # whitespace (\n, \r, \t, \f, and " ")
# (1 or more times (matching the most
# amount possible))
# --------------------------------------------------------------------
+--
\w+ # word characters (a-z, A-Z, 0-9, _) (1
# or more times (matching the most
# amount possible))
# --------------------------------------------------------------------
+--
)* # end of grouping
# --------------------------------------------------------------------
+--
) # end of \1
# --------------------------------------------------------------------
+--
\b # the boundary between a word char (\w) and
# something that is not a word char
# --------------------------------------------------------------------
+--
.+? # any character except \n (1 or more times
# (matching the least amount possible))
# --------------------------------------------------------------------
+--
\b # the boundary between a word char (\w) and
# something that is not a word char
# --------------------------------------------------------------------
+--
\1 # what was matched by capture \1
# --------------------------------------------------------------------
+--
\b # the boundary between a word char (\w) and
# something that is not a word char
/xms;
while ( /$RX/sg ) {
pos() = $-[0] + 1;
print "<$1>\n";
}
| [reply] [d/l] |
|
|
| [reply] |
|
|
| [reply] |
|
|
| [reply] |
Re: Finding recurring phrases
by salva (Canon) on May 16, 2006 at 19:18 UTC
|
a solution with O(NlogN) cost and with low memory consumption:
use strict;
use warnings;
my $data = do { local $/; lc <DATA> };
my @words = $data =~ /\w+/g;
{
no warnings;
sub cmpix {
for (my $off = 0;; $off++) {
my $cmp = $words[$a+$off] cmp $words[$b+$off];
return $cmp if $cmp;
}
}
}
my @ixs = sort cmpix 0..$#words;
# replace this code with duplication detector (it
# should be easy!):
for my $ix (@ixs) {
print substr(join(" ", @words[$ix..$#words]), 0, 50), "\n";
}
__DATA__
...
that outputs...
$ perl ./t.pl
1519
2nd may 1519
a character in the discworld series of novels is b
at clos luc france on 2nd may 1519
based largely on leonardo da vinci leonardo da vin
character in the discworld series of novels is bas
clos luc france on 2nd may 1519
da vinci died at clos luc france on 2nd may 1519
da vinci leonardo da vinci died at clos luc france
died at clos luc france on 2nd may 1519
discworld series of novels is based largely on leo
france on 2nd may 1519
in the discworld series of novels is based largely
is based largely on leonardo da vinci leonardo da
largely on leonardo da vinci leonardo da vinci die
leonard of quirm a character in the discworld seri
leonardo da vinci died at clos luc france on 2nd m
leonardo da vinci leonardo da vinci died at clos l
luc france on 2nd may 1519
may 1519
novels is based largely on leonardo da vinci leona
of novels is based largely on leonardo da vinci le
of quirm a character in the discworld series of no
on 2nd may 1519
on leonardo da vinci leonardo da vinci died at clo
quirm a character in the discworld series of novel
series of novels is based largely on leonardo da v
the discworld series of novels is based largely on
vinci died at clos luc france on 2nd may 1519
vinci leonardo da vinci died at clos luc france on
then finding duplicates is pretty obvious as they appear in consecutive entries in @ixs. | [reply] [d/l] [select] |
|
|
Brilliant ! Thanks salva. I'm going to use this solution. It's a clean and fast !
| [reply] |
Re: Finding recurring phrases
by GrandFather (Saint) on May 16, 2006 at 19:16 UTC
|
You may be interested in Fast common substring matching. It's intended for a very different application domain, but may suit the search you wish to perform ("find all recurring phrases").
DWIM is Perl's answer to Gödel
| [reply] |
Re: Finding recurring phrases
by TedPride (Priest) on May 16, 2006 at 23:01 UTC
|
Not so efficient, but output divided by words and counted. Sorting substrings alone is easy.
use strict;
use warnings;
my $words = 1; # Phrase must contain at least 1 word
my $size = 5; # Phrase must be at least 5 characters long
my $matches = 2; # Must be at least 2 copies of phrase
my (@words, @pos, @matches, $p, $c, $key);
$_ = join '', <DATA>;
$_ = lc($_);
@words = m/\w+(?:'\w+)?/g;
@pos = sort { mycmp($a, $b) } 0..$#words;
for $p ($words..$#pos) {
$c = mycount($p);
for (1..$c) {
$key = join ' ', @words[$pos[$p]..($pos[$p]+$_-1)];
next if length($key) < $size;
$matches[$_]{$key}++;
}
}
for (reverse $words..$#matches) {
print "$_ words:\n";
$c = $matches[$_];
for (sort { $c->{$b} <=> $c->{$a} } keys %$c) {
last if $c->{$_} < $matches - 1;
print " $_ : ",($c->{$_}+1),"\n";
}
}
sub mycount {
my $x = $pos[$_[0]];
my $y = $pos[$_[0]-1];
my $c = 0;
$c++ while $x <= $#words && $y <= $#words && $words[$x++] eq $word
+s[$y++];
return $c;
}
sub mycmp {
my ($x, $y) = @_;
while ($x <= $#words && $y <= $#words) {
return $c if $c = $words[$x++] cmp $words[$y++];
}
return $x <=> $y;
}
__DATA__
Section. 1.
All legislative Powers herein granted shall be vested in a Congress of
+ the United States, which shall consist of a Senate and House of Repr
+esentatives.
Section. 2.
The House of Representatives shall be composed of Members chosen every
+ second Year by the People of the several States, and the Electors in
+ each State shall have the Qualifications requisite for Electors of t
+he most numerous Branch of the State Legislature.
No Person shall be a Representative who shall not have attained to the
+ Age of twenty five Years, and been seven Years a Citizen of the Unit
+ed States, and who shall not, when elected, be an Inhabitant of that
+State in which he shall be chosen.
I suppose I could write this to use a linear counting method of some sort, but that's for another day. | [reply] [d/l] |
Re: Finding recurring phrases
by ww (Archbishop) on May 16, 2006 at 18:45 UTC
|
Standard Reply:
What have you tried? where's your code?
Non-standard caveat:
How do you intend to define "recurring phrases?"
"of the" is a phrase that's apt to recur (and many times) in many documents. Do you care? Or do you really mean that the ONLY recurring phrase you care about is "Leonardo da Vinci" or something similarly restricted?
And while the "speed" will depend (in part) on your algorithm, the time the process will take to run to completion will likely be most influenced by the size of the text to search and the specificity (or simplicity) of the search phrase (hint: read "regular expression"), for any given language and box upon which to run it.
So, please, rethink your question, a bit, CORRECTION, duh! and update it (anonymonk can't update) add info as new comment to provide additional detail.
pertinent update from anonymonk! --\v
| [reply] |
|
|
sub add_content {
my $self = shift;
my $content = shift;
$words = [ split(/\s+/, $content) ];
for ($i=0; $i < scalar(@$words) ; $i++) {
my $first_word = lc($words->[$i]);
my $second_word = lc($words->[$i+1]);
# 2 word phrases
if ($self->is_relevant_word($first_word , $second_word
+) && $first_word ne "$second_word") {
my $phrase = $first_word . " " . $second_word;
$self->{_related}{$phrase}++;
$self->_rate_phrase($phrase);
}
}
}
I'm just counting the occurences of phrases like this: $hash{$phrase}++ , and afterwards look for hash elements with values > 1. | [reply] [d/l] |
Re: Finding recurring phrases
by planetscape (Chancellor) on May 17, 2006 at 10:32 UTC
|
| [reply] |
Re: Finding recurring phrases
by leocharre (Priest) on May 16, 2006 at 19:17 UTC
|
#!/usr/bin/perl -w
use strict;
my $text = <<MYTEXT_MYTEXT;
This is my first sentence, Ok?
This is another one. Leonardo da Vinci died at Clos Lucé, France, on 2
+nd May, 1519.
The only solution I can think of is to loop through the text, word by
+word,
and search the remaining text for multiple occurences of that word. If
+ found,
check if the successive words are the same.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
But that method is very slow, as I need to loop through the content ma
+ny times.
I'm wondering if there's a way to do it more efficient.
"of the" is a phrase that's apt to recur (and many times) in many docu
+ments.
Do you care? Or do you really mean that the ONLY recurring phrase you
+care about is
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
"Leonardo da Vinci" or something similarly restricted?
And while the "speed" will depend (in part) on your algorithm, the tim
+e the process
will take to run to completion will likely be most influenced by the s
+ize of the text
to search and the specificity (or simplicity) of the search phrase (hi
+nt: read
"regular expression"), for any given language and box upon which to ru
+n it.
MYTEXT_MYTEXT
$text=~s/\n|\t/ /sg;
my @phrases = split(/\.|\?|\!/,$text);
# let's allow room for similarity
# by making a digest of each phrase
my %phrases=();
my %digests=();
for ( @phrases ){
my $phrase=$_;
$phrase=~/\w/ or next;
my $digest=lc($phrase);
$digest=~s/\W|\s|\d//g;
$phrases{$phrase}=$digest;
$digests{$digest}++;
}
my $count =0;
for (@phrases){
my $phrase=$_;
$phrase=~/\w/ or next;
print STDERR "$count) phrase [[[$phrase]]]\ndigest [[[$phrases{$ph
+rase}]]]\n"
."digest occurrences: ".$digests{$phrases{$phrase}}."\n\n";
$count++;
}
Produces as output:
[leo@mescaline ~]$ perl recurring.pl
0) phrase [[[This is my first sentence, Ok]]]
digest [[[thisismyfirstsentenceok]]]
digest occurrences: 1
1) phrase [[[ This is another one]]]
digest [[[thisisanotherone]]]
digest occurrences: 1
2) phrase [[[ Leonardo da Vinci died at Clos Lucé, France, on 2nd May,
+ 1519]]]
digest [[[leonardodavincidiedatcloslucfranceonndmay]]]
digest occurrences: 1
3) phrase [[[ The only solution I can think of is to loop through the
+text, word by word, and search the remaining text for multiple occur
+ences of that word]]]
digest [[[theonlysolutionicanthinkofistoloopthroughthetextwordbywordan
+dsearchtheremainingtextformultipleoccurencesofthatword]]]
digest occurrences: 1
4) phrase [[[ If found, check if the successive words are the same]]]
digest [[[iffoundcheckifthesuccessivewordsarethesame]]]
digest occurrences: 1
5) phrase [[[ But that method is very slow, as I need to loop through
+ the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4
6) phrase [[[ I'm wondering if there's a way to do it more efficient]]
+]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4
7) phrase [[[ But that method is very slow, as I need to loop through
+the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4
8) phrase [[[ I'm wondering if there's a way to do it more efficient]]
+]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4
9) phrase [[[ But that method is very slow, as I need to loop through
+the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4
10) phrase [[[ I'm wondering if there's a way to do it more efficient]
+]]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4
11) phrase [[[ But that method is very slow, as I need to loop through
+ the content many times]]]
digest [[[butthatmethodisveryslowasineedtoloopthroughthecontentmanytim
+es]]]
digest occurrences: 4
12) phrase [[[ I'm wondering if there's a way to do it more efficient]
+]]
digest [[[imwonderingiftheresawaytodoitmoreefficient]]]
digest occurrences: 4
13) phrase [[[ "of the" is a phrase that's apt to recur (and many time
+s) in many documents]]]
digest [[[oftheisaphrasethatsapttorecurandmanytimesinmanydocuments]]]
digest occurrences: 1
14) phrase [[[ Do you care]]]
digest [[[doyoucare]]]
digest occurrences: 1
15) phrase [[[ Or do you really mean that the ONLY recurring phrase yo
+u care about is "Leonardo da Vinci" or something similarly restricte
+d]]]
digest [[[ordoyoureallymeanthattheonlyrecurringphraseyoucareaboutisleo
+nardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 1
16) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6
17) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6
18) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6
19) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6
20) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6
21) phrase [[[ "Leonardo da Vinci" or something similarly restricted]]
+]
digest [[[leonardodavinciorsomethingsimilarlyrestricted]]]
digest occurrences: 6
22) phrase [[[ And while the "speed" will depend (in part) on your al
+gorithm, the time the process will take to run to completion will li
+kely be most influenced by the size of the text to search and the spe
+cificity (or simplicity) of the search phrase (hint: read "regular e
+xpression"), for any given language and box upon which to run it]]]
digest [[[andwhilethespeedwilldependinpartonyouralgorithmthetimethepro
+cesswilltaketoruntocompletionwilllikelybemostinfluencedbythesizeofthe
+texttosearchandthespecificityorsimplicityofthesearchphrasehintreadreg
+ularexpressionforanygivenlanguageandboxuponwhichtorunit]]]
digest occurrences: 1
| [reply] [d/l] [select] |
Re: Finding recurring phrases
by Herkum (Parson) on May 16, 2006 at 18:45 UTC
|
Load your document as one big string and search that,
open my $fh, '<', 'target.txt'; ## open your file
my $file;
{
local $/ = undef; ## disable the line seperator variable.
$file = <$fh>; ## load your file to a string.
close $fh; ## close your file
}
my $phrase = qr{ ## begin phrase
Leonardo \s+ ## needed to avoid weird whitespace issues
da \s+ ## needed to avoid weird whitespace issues
Vinci }gxms; ## match all occurences of our phrase
# Returns an array with all the matches
my @matches = $file =~ $phrase;
# Returns a count of how many matches you had for your string
print "Found " . @matches . " matches\n";
| [reply] [d/l] |
| A reply falls below the community's threshold of quality. You may see it by logging in. |