Jester has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I have a small problem that I can't seem to solve. I need to find duplicate text within a paragraph. Let's say my text looks like this:

"The cat jumped over the dog. Smart cat! He jumped over the dog. The cat jumped over the dog."

I don't need to find all 3 repetitions of the word cat. However, I need to find the longest repeated sentence: 'The cat jumped over the dog'.

I was thinking of a lookahead, and tried something like

m/([\w\s\.]?){?=$1)/g;

which kinda fails miserably.

Any help much appreciated!

Replies are listed 'Best First'.
Re: Finding duplicate text in a paragraph
by davido (Cardinal) on Aug 20, 2012 at 20:57 UTC

    This is related to the Longest Repeated Substring Problem. According to the article I linked to, the problem can be solved in linear time after building a Suffix Tree. It turns out there's a module on CPAN that works with suffix trees: Tree::Suffix.

    Tree::Suffix is a big pain to install. You have to visit http://www.icir.org/christian/libstree/, download and unpack its tarball, and then sudo ./configure. After it's installed you have to force the installation of Tree::Suffix: cpanm -f Tree::Suffix. I haven't taken the time to uncover what's wrong with it that causes some failures. But it seems to work once installed.

    After that's all said and done, it becomes easy:

    use Tree::Suffix; my $string = 'The cat jumped over the dog. Smart cat! He jumped over the dog. + ' . 'The cat jumped over the dog.'; my $tree = Tree::Suffix->new($string); print "$_\n" for $tree->lrs;

    The output...

    The cat jumped over the dog.

    $tree->lrs is a convenient alias to $tree->longest_repeated_substrings; you can use either method name.

    If anyone knows of a better module for building suffix trees (one that installs cleanly), please follow-up with suggestions.

    An aside: If anyone knows how to get in touch with the author for SuffixTree please let me know. His listed email address is no longer valid. I've also sent a message to module-authors@perl.org trying to locate him so that I can work on applying the patches that have been sitting for a few years in the RT. In the meantime an unofficial release has been uploaded.


    Dave

      Interesting module and article. This is more what I was looking for! Thanks a lot, I'll look into it.
Re: Finding duplicate text in a paragraph
by Kenosis (Priest) on Aug 20, 2012 at 20:20 UTC

    I'm inclined to use Lingua::EN::Sentence to get the sentences, since it's a well-developed module for that purpose. Given this, perhaps the following will help:

    use Modern::Perl; use Lingua::EN::Sentence qw( get_sentences ); my ( $LRS, %seen ) = ''; my $sentences = 'Mr. Cat jumped over the dog. Smart Mr. Cat! He jumped + over the dog. Mr. Cat jumped over the dog.'; map { $seen{$_}++ and length $_ > length $LRS and $LRS = $_ } @{ get_sentences($sentences) }; say $LRS;

    Output:

    Mr. Cat jumped over the dog.

    Update: Modified some sentences to challenge Lingua::EN::Sentence.

      Interesting module, I'll look into in, thank you.
Re: Finding duplicate text in a paragraph
by BrowserUk (Patriarch) on Aug 20, 2012 at 20:59 UTC

    You could try this:

    #! perl -slw use strict; my $para = do{ local $/; <DATA> }; $para =~ tr[ \t\n][ ]s; ## normalise whitespace my $longest = ''; length( $1 ) > length( $longest ) and print '?', $longest = $1 while $para =~ m[(.+)(?=.*?\1)]ig; print $longest; __DATA__ Ah, I am kind of embarrassed not to have thought of this. I was so focused on look ahead, for some reason, but yeah, this works. The thing is, I need to match more than one sentence, potentially whole sections of texts can be duplicated and I need to find them. I will try your solution see how it goes. Thanks!I am kind of embarrassed not to have thought of this for some reason, but yeah, this works. The thing is, I need to match more than one sentence,

    Outputs:

    C:\test>junk24 ?Ah, ?I am kind of embarrassed not to have thought of this ?for some reason, but yeah, this works. The thing is, I need to match +more than one sentence, for some reason, but yeah, this works. The thing is, I need to match m +ore than one sentence,

    Remove the print '?', once your happy it is doing the right thing.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Finding duplicate text in a paragraph
by hbm (Hermit) on Aug 20, 2012 at 19:52 UTC

    Something like this?

    use strict; use warnings; my $text = 'The cat jumped over the dog. Smart cat! He jumped over the + dog. The cat jumped over the dog. Smart cat!'; my %seen; my $longest = ''; while ($text =~ /\s*(.+?[!?.])/g) { $longest = $1 if $seen{$1}++ && length $1 > length $longest; } print $longest,$/;

    Prints:

    The cat jumped over the dog.

    But it's a whole other challenge if you mean "phrase" and not (terminated) sentence.

      Ah, I am kind of embarrassed not to have thought of this. I was so focused on look ahead, for some reason, but yeah, this works. The thing is, I need to match more than one sentence, potentially whole sections of texts can be duplicated and I need to find them. I will try your solution see how it goes. Thanks!
      Oh, in fact this does not quite work. In that case, I would need it to print "The cat jumped over the dog." and "Smart cat!" as strings that are repeated.