BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:
Given a string of arbitrary (long) length that is known to comprise of a number of repetitions of an unknown length substring, (thought the last repetition my be incomplete), how to find that repeat sequence?
Eg. Given 'abcdabcdabcdabcdab' find 'abcd'.
Complications:
Eg. In 'abcdabcdabceabcdabcdabceab'; 'abcd' is a false rep; the required rep is 'abcdabcdabce';
The number of characters 'ignored' at the end of the string should be less than the length of the rep. Is is possible to code that into a regex? I guess it could just be checked afterwards.
I'm assuming that a regex solution would be possible, but I cannot wrap my brain around it today?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Finding repeat sequences.
by DamianConway (Beadle) on Jun 18, 2013 at 21:43 UTC | |
Damian | [reply] [d/l] |
by DamianConway (Beadle) on Jun 19, 2013 at 22:05 UTC | |
Here's a significantly optimized version of my previous solution, which draws heavily on anomalous's excellent work, but only bothers to capture the initial repeated component (in $1) as that's apparently all that's wanted.
It still passes BrowserUK's gauntlet, however in my own testing it's approximately 400 times faster than my previous attempt. I suspect that's still not sufficient to make it competitive with the non-regex solutions. (I know which one I'd rather maintain, though ;-) As you can see, the key to the improvement in regex performance was—as usual—to eliminate opportunities for backtracking or capturing. Damian | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 20, 2013 at 01:08 UTC | |
I'm not sure why you think it would be any easier to maintain that the non-regex solutions? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by DamianConway (Beadle) on Jun 20, 2013 at 08:02 UTC | |
by hdb (Monsignor) on Jun 20, 2013 at 09:00 UTC | |
| |
by BrowserUk (Patriarch) on Jun 20, 2013 at 15:11 UTC | |
| |
by BrowserUk (Patriarch) on Jun 18, 2013 at 22:10 UTC | |
That works perfectly and does exactly what I want. Thank you. I suspect it may run rather slowly as is -- /xms etc. -- but given the included test suite, tuning should be simple :) (BTW: Nice to see you: to see you ... ) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Finding repeat sequences.
by tobyink (Canon) on Jun 18, 2013 at 22:23 UTC | |
Not using regexps at all, but...
Note that when there are multiple possible matches, this returns the shortest, because it doesn't make sense to return the longest - the longest is uninteresting. For example, given the input abcabca, it could be that the answer is abc repeated two and a bit times, or abcabc repeated one and a bit times, or abcabca repeated exactly once. (Well, not really "repeated" but you know what I mean. The entire input string itself is always a valid and uninteresting answer.) Or, depending on how the problem is defined, the correct answer might be abcabcaxx repeated less than one time - i.e. the first repetition was truncated! So the only interesting answer to return is the shortest possible one.
package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 23:01 UTC | |
That's interesting...(not faint praise.). (I'm already thinking of optimisations; like build the longest repeat sequence and then use substr to get shorter versions.) For example, given the input abcabca, it could be ... So the only interesting answer to return is the shortest possible one. Hm. Damn you for making me think (again) at this time of night :) There will always be at least one complete substring. If there is more than 1 but less than 2, ie. 1 rep + 1 partial; (I believe) it will always be possible to determine the longest < length string match; because the residual always matches the length( residual ) first characters of the string. So, if the string is 'abcabca'; the rep could be 'abcabc' or 'abc'. But if the rep consists entirely of an exact integer number of reps of a subsubstring, then the substring is that subsubstring and the string consists of rep*n(n>1) + a partial. Thus, I believe that there is only ever one results. It will be interesting to pitch your solution against DamianConway's regex and see how they compare. I simply have no feel for it; but it's a job for tomorrow. Thank you. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by hdb (Monsignor) on Jun 19, 2013 at 07:46 UTC | |
UPDATE: WARNING: the following code does not work in all circumstances. Sorry! Here is a variant of tobyink's solution that uses index to look ahead when the current candidate string repeats and then enlengthens (is that an English word?) it accordingly.
UPDATE: Eily's solution below Re^3: Finding repeat sequences. can be used to avoid the construction of the repeated string (as it is the same just with offset). Therefore, this works even better:
| [reply] [d/l] [select] |
|
Re: Finding repeat sequences.
by choroba (Cardinal) on Jun 18, 2013 at 19:10 UTC | |
returns
Can you give more input samples to exemplify the other constraints?
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 19:59 UTC | |
They are kind of hard to come up with, but okay. Given the rep 'aaaabaaaaba' and a string containing one whole and one partial rep 'aaaabaaaabaaaaabaaaab'
Which isn't correct because:
You can fix that by removing the redundant .* per LanX's version: m[(.+)\1] but then you get:
Which isn't right:
I realise that this is a 'cheat' as there in no complete repetition to find, but it is one possible scenario. Given the string will always consist of 1 or more repetitions of the substring, whatever partial substring (if any) is at the end of the string should match the same number of characters at the start of the string. That's the bit I'm having trouble wrapping my head around. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Re: Finding repeat sequences.
by LanX (Saint) on Jun 18, 2013 at 19:09 UTC | |
since regex are greedy, this works for me
edit: I doubt that the answer is so simple, but this should give you a start to ask more precisely...
Cheers Rolf ( addicted to the Perl Programming Language) | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 20:01 UTC | |
Please see Re^2: Finding repeat sequences.. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Finding repeat sequences. (only mostly regex)
by tye (Sage) on Jun 18, 2013 at 19:40 UTC | |
I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer. A simple regex can get a good guess and tell you when that guess has failed in such a way that each subsequent guess will be more than twice as long as the previous guess so the regex doesn't have to be run very many times:
You likely can optimize this by copying less stuff, of course. (Update: Well, I didn't get very rigorous in proving to myself that $pattern.$repeat is always too short. But I believe that to be the case. One should validate or refute that assumption before deciding whether to use this.) - tye | [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 20:04 UTC | |
I assume that the pattern must repeat at least twice, otherwise, the full string is always the longest answer. I wish that were the case. It mostly will be, but sometimes the string will consist of 1 complete and 1 partial rep. But the partial rep at the end *will* exactly match the same number of characters at the beginning of the string, so it will always be possible to determine the rep. But how to encode that in a regex or at least avoid a brute force chop and compare? With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by tye (Sage) on Jun 18, 2013 at 20:13 UTC | |
Note that, based on that definition, if the first and last characters are the same, then the answer is "the string minus the last character". Which leads to:
Which leads to a full solution of:
which might be horribly inefficient (at least for some cases) or might not; I haven't considered it. - tye | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 20:21 UTC | |
by tye (Sage) on Jun 18, 2013 at 21:17 UTC | |
by choroba (Cardinal) on Jun 18, 2013 at 20:10 UTC | |
? If it is complete, you get the whole one.
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
| [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 20:18 UTC | |
by choroba (Cardinal) on Jun 18, 2013 at 20:23 UTC | |
| |
|
Re: Finding repeat sequences.
by rjt (Curate) on Jun 18, 2013 at 20:23 UTC | |
This smells an awful lot to me like the Longest Repeated Substring Problem, maybe with a bit of a twist. Have you looked at SuffixTree?
What is not clear to me from your description is whether you are looking for the longest substring with at least one repeat, or whether you are looking for the arbitrary length substring with the highest repeat count, or whether you are looking for the substring which, along with its (adjacent?) repeats comprises the longest length, or something else. Can you provide some more information and examples? A Super Search revealed:
| [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 18, 2013 at 20:39 UTC | |
What is not clear to me from your description is whether you are looking for the longest substring with at least one repeat, or whether you are looking for the arbitrary length substring with the highest repeat count, or whether you are looking for the substring which, along with its (adjacent?) repeats comprises the longest length, or something else. Can you provide some more information and examples? I thought (believe) I have described the problem exactly. Constructing examples is hard -- I have a program running (for 4+ hours now) generating controlled random string and trying to find exceptional cases. I'll try the description (unsatisfactory) again. The complete string will consist of, and only of, one or more repetitions of a substring, But the last repetition may be truncated. In code:
That is, all these are valid strings and all have 'fred' as their substring:
With regard to suffix trees, I feel I would probably need a prefix tree (Trie) instead, but these string can be very long and every implementation of Trie I've seen would not handle them. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by rjt (Curate) on Jun 18, 2013 at 22:31 UTC | |
I thought (believe) I have described the problem exactly. Oh, I do not question that. What is in question is my own ability to comprehend. :-) Thanks for clarifying. I have a better idea now.
Edit: Aha, it looks like Mr. Conway hit it on the head! | [reply] |
|
Re: Finding repeat sequences.
by AnomalousMonk (Archbishop) on Jun 19, 2013 at 00:24 UTC | |
... seems to do the trick for all the permutations I've seen so far. I have no idea of the performance hit of the (?(?{ CODE })yes-pattern) thingy. (Update: index $1, $2 works just as well as 0 != index $1, $2) Read more... (4 kB)
Update: Just noticed needless /g modifier in posted regex. Final regex should be | [reply] [d/l] [select] |
|
Re: Finding repeat sequences. (Results:Part 1)
by BrowserUk (Patriarch) on Jun 19, 2013 at 16:17 UTC | |
Below are the results from my first pass -- verifying basic functionality -- of (my adaptions of) the 8 solutions from tye, choroba, DamianConway, tobyink, AnomalousMonk, hdb, Eily, sundialsvc4 in this thread:
If authors want to correct (my adaptions of) their solutions that's great, (but please don't moan at me If I screwed the pouch adapting them to subroutines :). Here's the test harness: <Reveal this spoiler or all in this thread>
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 19, 2013 at 18:06 UTC | |
After exclusions on functionality, the three left standing -- DamianConway, tobyink & AnomalousMonk -- went forward to performance testing where the latter quickly fell by the wayside. Of the remaining two, tobyink's solution is hands down winner with a cumulative 66 seconds version DamianConway's 1670 seconds:
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by hdb (Monsignor) on Jun 19, 2013 at 19:07 UTC | |
| [reply] [d/l] |
by BrowserUk (Patriarch) on Jun 19, 2013 at 19:46 UTC | |
by hdb (Monsignor) on Jun 19, 2013 at 20:03 UTC | |
by tye (Sage) on Jun 20, 2013 at 00:40 UTC | |
Looking for 'fredfre' in 'fredfrefredfr' tye found 'fredfrefred'; excluded from further consideration Looks like you changed your definition of your desired results, so it was silly to use a "solution" tailored to a different definition. This particular example matches my original speculation of what made sense, which means you might want to try my original solution. Though, I didn't wade through everything trying to find the various redefinitions of what was desired (besides the ones given in reply to my nodes) and then sort them chronologically in order to figure out what the new requirements actually are. - tye | [reply] |
by BrowserUk (Patriarch) on Jun 20, 2013 at 01:11 UTC | |
There is no new requirement. Re-read the OP. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Finding repeat sequences.
by locked_user sundialsvc4 (Abbot) on Jun 19, 2013 at 04:06 UTC | |
Consider the following:
The “key” to this puzzle, as I read it, is the tail. This is the shortest string at the end of the string which matches the beginning of the string. The example code tries aggressively to find that number by adding power-of-two-smaller successive increments as many times as it can. (Note: might this introduce a flaw, vs. a simple 2-up count?) The second part of the algorithm now tries to find the longest string which occurs an integral number of times in the leading (repeated ...) portion. We know that the length of this string must be a mathematical factor, i.e. (length mod factor == 0). Only the first and second occurrence must be considered. If none can be found, the string consists of one non-repeated occurrence. | |
by locked_user sundialsvc4 (Abbot) on Jun 19, 2013 at 04:22 UTC | |
It might even be easier than this ... (pseudocode this time)
As before, start by looking for a tail. If you think you found one, look for repetitions in possibly-sized increments of the remaining string. This approach will consider all successively-larger candidates for the "tail," up to and including the largest tail, which is "half of it." For each tail-size, it looks for one repetition of all possible sizes which would fit evenly within this area, knowing that the left-hand portion is defined to consist only of repetitions. Continue for all tail-sizes, and for each of these, all rep-sizes, until the first success is found. It works only for data which does, in fact, match this assertion. | |
by Eily (Monsignor) on Jun 19, 2013 at 12:40 UTC | |
Oh! I didn't think of that one. But I'm not sure why you have to do all those extra verifications. Since the input data is a repeated pattern, the length of that pattern is how little you can shift your string to the left and have a perfect match.
abcdabcdabce | [reply] [d/l] |
by locked_user sundialsvc4 (Abbot) on Jun 19, 2013 at 13:05 UTC | |
I think that the pseudo-coded solution is the strongest one, if it is made to stop after finding the longest possible substring (then last), and if it is allowed to find all possible (reasonably-long) tails, once again starting with the longest one. (These would be what a human being would probably consider to be the best “right answers.”) With very-short tails, as written it might produce wrong answers if it considers only the first two occurrences (consider string 'aaba' if the tail were merely 'a' .. incorrect). But the essential idea, I think, is still valid. One reason why I wrote it this way was in an effort to avoid “churning” memory when dealing with exceptionally-long strings. You don’t need to consider any string that isn’t a tail, nor, within the head, any repeated-string candidate that won’t fit. That subdivides the problem into two searches, both of which have only a few possibilities each. It might not yet be bug-free, but it ought to be close . . . | |
|
Re: Finding repeat sequences.
by Eily (Monsignor) on Jun 19, 2013 at 08:18 UTC | |
The easiest way to do it with is regex is to use the reversed string, because if you want to check that the end is included in the repeated pattern, you have to be able to make a reference to it.
abcdabcdabce Edit: it probably is faster with the pattern /^(.+?)(.*?\1)\2*$/ since it uses a minimum length string as the trailing partial repetition, instead of a maximum one (ie : it does not read the whole input and bactracks). The first one should work in any case though, the second would work only if the base string is repeated at least once, even partially. | [reply] [d/l] [select] |
|
Re: Finding repeat sequences.
by LanX (Saint) on Jun 21, 2013 at 01:03 UTC | |
is $str = ($pattern x $n ) . substr($pattern,0,$k) with 0 <= $k < ($l = length($pattern)) and the task is to find maximum $pattern for a given $str to fit these constraints? If yes, some simple mathematics should already considerably minimize the set of possible combinations you need to investigate with regexes. Cheers Rolf ( addicted to the Perl Programming Language)
test
| [reply] [d/l] [select] |
by hdb (Monsignor) on Jun 21, 2013 at 10:54 UTC | |
It is to find the shortest pattern, otherwise $n==1 always. Correction: replaced $n=1 with $n==1 | [reply] |
by BrowserUk (Patriarch) on Jun 21, 2013 at 01:24 UTC | |
and the task is to find maximum $pattern to fit these constraints? Um. I cannot see any errors in that. So yes. If yes, some simple mathematics should already considerably minimize the set of possible combinations you need to investigate with regexes. Hm. A realistic, but relatively small, example from my test harness:
L=64000, N = 10,000, K=28,740. But those could equally well be: L=16,000, N = 40,001, K=12,740; or (thousands*) of other permutations. I don't think it helps. (*I'm being very, very conservative; my best guess is 100s, of millions.) With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by LanX (Saint) on Jun 21, 2013 at 02:22 UTC | |
needs to be extended for longer possible tails. But taking the dimensions of your data I doubt that regexes are appropriate. You could test all $patterns which repeat at least once (or x times) and calculate $k = $m % $l with $m =length ($str), and check if $str starts and ends with the same substring $tail of length $k and then check if the pattern continues repeating. Or start eliminating all possible $tails and check if $l of a repeating pattern is a divisor of the $rest. Had no time to check all the other posted solutions and don't wanna reinvent the wheel, so I better stop here! =) HTH
Cheers Rolf ( addicted to the Perl Programming Language) | [reply] [d/l] [select] |
by BrowserUk (Patriarch) on Jun 21, 2013 at 02:48 UTC | |