Re: Finding repeat sequences.

Not using regexps at all, but...

#!/usr/bin/env perl

use 5.012;
use Test::More;

sub find_substring
{
    my $input  = shift;
    my $length = length $input;
    
    for my $i (1 .. $length)
    {
        my $possible = substr($input, 0, $i);
        my $repeated = $possible x (1 + int($length / $i));
        return $possible if $input eq substr($repeated, 0, $length);
    }
    
    return "";
}

my %eg = (
    abcdabcdabcdabcdab          => "abcd",
    abcdabcdabceabcdabcdabceab  => "abcdabcdabce",
    aaaabaaaabaaaaabaaaab       => "aaaabaaaaba",
);

for my $input (sort keys %eg)
{
    my $expected = $eg{$input};
    my $got      = find_substring($input);
    
    is($got, $expected, "result is '$expected' given input '$input'");
}

done_testing;
[download]

Note that when there are multiple possible matches, this returns the shortest, because it doesn't make sense to return the longest - the longest is uninteresting.

For example, given the input abcabca, it could be that the answer is abc repeated two and a bit times, or abcabc repeated one and a bit times, or abcabca repeated exactly once. (Well, not really "repeated" but you know what I mean. The entire input string itself is always a valid and uninteresting answer.) Or, depending on how the problem is defined, the correct answer might be abcabcaxx repeated less than one time - i.e. the first repetition was truncated!

So the only interesting answer to return is the shortest possible one.

package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

Comment on Re: Finding repeat sequences. Select or Download Code

Replies are listed 'Best First'.
Re^2: Finding repeat sequences. by BrowserUk (Patriarch) on Jun 18, 2013 at 23:01 UTC
That's interesting...(not faint praise.). (I'm already thinking of optimisations; like build the longest repeat sequence and then use substr to get shorter versions.) For example, given the input abcabca, it could be ... So the only interesting answer to return is the shortest possible one. Hm. Damn you for making me think (again) at this time of night :) There will always be at least one complete substring. If there is more than 1 but less than 2, ie. 1 rep + 1 partial; (I believe) it will always be possible to determine the longest < length string match; because the residual always matches the length( residual ) first characters of the string. So, if the string is 'abcabca'; the rep could be 'abcabc' or 'abc'. But if the rep consists entirely of an exact integer number of reps of a subsubstring, then the substring is that subsubstring and the string consists of rep*n(n>1) + a partial. Thus, I believe that there is only ever one results. It will be interesting to pitch your solution against DamianConway's regex and see how they compare. I simply have no feel for it; but it's a job for tomorrow. Thank you. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: Finding repeat sequences. by hdb (Monsignor) on Jun 19, 2013 at 07:46 UTC
UPDATE: WARNING: the following code does not work in all circumstances. Sorry! Here is a variant of tobyink's solution that uses `index` to look ahead when the current candidate string repeats and then enlengthens (is that an English word?) it accordingly. `sub find_substring { my $input = shift; my $length = length $input; my $i = 0; my $possible; while( 1 ) { $possible = substr $input, 0, $i+1; # increase length by 1 $i = index $input, $possible, $i+1; # find next occurence of c +andidate return $input if $i < 0; # if not found return full + string => no repetition $possible = substr $input, 0, $i; # this is the minimum leng +th candidate return $possible if $input eq substr($possible x (1 + int($len +gth / $i)), 0, $length); # success } }` [download] UPDATE: Eily's solution below Re^3: Finding repeat sequences. can be used to avoid the construction of the repeated string (as it is the same just with offset). Therefore, this works even better: `sub find_substring { my $input = shift; my $length = length $input; my $i = 0; while( 1 ) { $i = index( $input, substr( $input, 0, $i+1 ), $i+1); return $input if $i < 0; return substr( $input, 0, $i) if substr( $input, $i ) eq substr($i +nput, 0, $length - $i); } }` [download]	[reply] [d/l] [select]