If I understand correctly, you're interested in long substrings repeated numerous times. Here's a roughed-out approach based roughly on my Re^3: Fast common substring matching, which is, in turn, based roughly on the bzip algorithm.

For purposes of sanity, take a stab at some length of substring that you simply don't expect to be repeated. Maybe somewhere in the 100 to 1000 range. Make a file of all substrings of that length. (Unix) sort the lines of the file.

Now you just need to look at each line's neighbors to see how long their common prefix is. The longest repeated substring is guaranteed to be a common prefix of two neighboring lines.

To find how many times a substring is repeated, you'll have to keep checking subsequent lines until they don't have enough prefix in common to be interesting.

I haven't written any code for this, and probably won't. But I hope the description is helpful.

Update: I wrote some example code for finding the longest repeated substring. More: I enhanced it to generate a long string (60000 digits) and to limit the substring length, and to time itself. It runs in under 2 seconds on my PC.

use strict; use warnings; use Time::HiRes; my $startTime = [Time::HiRes::gettimeofday ()]; my $str = ''; $str .= int rand 10 for 1..60000; my @subs = sort map substr($str, $_, 400), 0..length($str)-1; # Find the longest common prefix between any two neighbors my $length = 0; my $value = ''; for my $i (0..$#subs-1) { my ($common) = map length, ($subs[$i] ^ $subs[$i+1]) =~ /^(\0*)/; if ($common > $length) { ($length, $value) = ($common, substr($subs[$i], 0, $common)); } } print "$value is the longest repeated substr\n"; print "Completed in " . Time::HiRes::tv_interval ($startTime) . "\n";

Caution: Contents may have been coded under pressure.

In reply to Re: Longest repeated string... by Roy Johnson
in thread Longest repeated string... by Yzzyx

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.