Finding duplicate text in a paragraph

Jester has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Finding duplicate text in a paragraph by davido (Cardinal) on Aug 20, 2012 at 20:57 UTC
This is related to the Longest Repeated Substring Problem. According to the article I linked to, the problem can be solved in linear time after building a Suffix Tree. It turns out there's a module on CPAN that works with suffix trees: Tree::Suffix. Tree::Suffix is a big pain to install. You have to visit http://www.icir.org/christian/libstree/, download and unpack its tarball, and then `sudo ./configure`. After it's installed you have to force the installation of Tree::Suffix: `cpanm -f Tree::Suffix`. I haven't taken the time to uncover what's wrong with it that causes some failures. But it seems to work once installed. After that's all said and done, it becomes easy: `use Tree::Suffix; my $string = 'The cat jumped over the dog. Smart cat! He jumped over the dog. + ' . 'The cat jumped over the dog.'; my $tree = Tree::Suffix->new($string); print "$_\n" for $tree->lrs;` [download] The output... `The cat jumped over the dog.` [download] `$tree->lrs` is a convenient alias to `$tree->longest_repeated_substrings`; you can use either method name. If anyone knows of a better module for building suffix trees (one that installs cleanly), please follow-up with suggestions. An aside: If anyone knows how to get in touch with the author for SuffixTree please let me know. His listed email address is no longer valid. I've also sent a message to module-authors@perl.org trying to locate him so that I can work on applying the patches that have been sitting for a few years in the RT. In the meantime an unofficial release has been uploaded. Dave	[reply] [d/l] [select]
Re^2: Finding duplicate text in a paragraph by Jester (Novice) on Aug 20, 2012 at 21:00 UTC
Interesting module and article. This is more what I was looking for! Thanks a lot, I'll look into it.	[reply]
Re: Finding duplicate text in a paragraph by Kenosis (Priest) on Aug 20, 2012 at 20:20 UTC
I'm inclined to use `Lingua::EN::Sentence` to get the sentences, since it's a well-developed module for that purpose. Given this, perhaps the following will help: `use Modern::Perl; use Lingua::EN::Sentence qw( get_sentences ); my ( $LRS, %seen ) = ''; my $sentences = 'Mr. Cat jumped over the dog. Smart Mr. Cat! He jumped + over the dog. Mr. Cat jumped over the dog.'; map { $seen{$_}++ and length $_ > length $LRS and $LRS = $_ } @{ get_sentences($sentences) }; say $LRS;` [download] Output: `Mr. Cat jumped over the dog.` [download] Update: Modified some sentences to challenge `Lingua::EN::Sentence`.	[reply] [d/l] [select]
Re^2: Finding duplicate text in a paragraph by Jester (Novice) on Aug 20, 2012 at 20:43 UTC
Interesting module, I'll look into in, thank you.	[reply]
Re^3: Finding duplicate text in a paragraph by Kenosis (Priest) on Aug 20, 2012 at 20:46 UTC
You're most welcome, Jester!	[reply]
Re: Finding duplicate text in a paragraph by BrowserUk (Patriarch) on Aug 20, 2012 at 20:59 UTC
You could try this: #! perl -slw use strict; my $para = do{ local $/; <DATA> }; $para =~ tr[ \t\n][ ]s; ## normalise whitespace my $longest = ''; length( $1 ) > length( $longest ) and print '?', $longest = $1 while $para =~ m[(.+)(?=.*?\1)]ig; print $longest; __DATA__ Ah, I am kind of embarrassed not to have thought of this. I was so focused on look ahead, for some reason, but yeah, this works. The thing is, I need to match more than one sentence, potentially whole sections of texts can be duplicated and I need to find them. I will try your solution see how it goes. Thanks!I am kind of embarrassed not to have thought of this for some reason, but yeah, this works. The thing is, I need to match more than one sentence, [download] Outputs: `C:\test>junk24 ?Ah, ?I am kind of embarrassed not to have thought of this ?for some reason, but yeah, this works. The thing is, I need to match +more than one sentence, for some reason, but yeah, this works. The thing is, I need to match m +ore than one sentence,` [download] Remove the `print '?',` once your happy it is doing the right thing. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply] [d/l] [select]
Re: Finding duplicate text in a paragraph by hbm (Hermit) on Aug 20, 2012 at 19:52 UTC
Something like this? `use strict; use warnings; my $text = 'The cat jumped over the dog. Smart cat! He jumped over the + dog. The cat jumped over the dog. Smart cat!'; my %seen; my $longest = ''; while ($text =~ /\s*(.+?[!?.])/g) { $longest = $1 if $seen{$1}++ && length $1 > length $longest; } print $longest,$/;` [download] Prints: `The cat jumped over the dog.` [download] But it's a whole other challenge if you mean "phrase" and not (terminated) sentence.	[reply] [d/l] [select]
Re^2: Finding duplicate text in a paragraph by Jester (Novice) on Aug 20, 2012 at 20:40 UTC
Ah, I am kind of embarrassed not to have thought of this. I was so focused on look ahead, for some reason, but yeah, this works. The thing is, I need to match more than one sentence, potentially whole sections of texts can be duplicated and I need to find them. I will try your solution see how it goes. Thanks!	[reply]
Re^2: Finding duplicate text in a paragraph by Jester (Novice) on Aug 20, 2012 at 20:45 UTC
Oh, in fact this does not quite work. In that case, I would need it to print "The cat jumped over the dog." and "Smart cat!" as strings that are repeated.	[reply]