find all repeating sequences

charm has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: find all repeating sequences by Marshall (Canon) on Dec 20, 2016 at 20:41 UTC
I believe you want: a) the match and also b) the position of the match? Modifying kschwab's regex.. `#!/usr/bin/perl use strict; use warnings; my $string = 'AAAA TTT GGGG CCCC AAAAAA CCCC ABAB CCC TTT TTTT GGGEGEE +'; while ($string =~m/\b((\w)\2{2,6})\b/g) { print "$-[1] $1 \n"; } __END__ #prints 0 AAAA 5 TTT 9 GGGG 14 CCCC 19 AAAAAA 26 CCCC 36 CCC 40 TTT 44 TTTT` [download] Perlvar mentions @- and @+ although the explanation there is not completely clear. In Perl 5.6.0 the "@-" and "@+" dynamic arrays were introduced that supply the indices of successful matches. @- is the beginning and @+ is the ending. `$-[0]` and `$+[0]` correspond to entire pattern, while `$-[N]` and `$+[N]` correspond to the $N ($1, $2, etc.) submatches.	[reply] [d/l] [select]
Re^2: find all repeating sequences by talexb (Chancellor) on Dec 20, 2016 at 21:01 UTC
Wow. I never knew about `@-` and `@+` .. that's really quite brilliant. Thanks!!!! Alex / talexb / Toronto Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.	[reply] [d/l] [select]
Re^2: find all repeating sequences by charm (Initiate) on Dec 21, 2016 at 09:04 UTC
Thank you! This is exactly what I needed	[reply]
Re: find all repeating sequences by kennethk (Abbot) on Dec 20, 2016 at 20:45 UTC
Please read Markup in the Monastery. Note how your square backets have been turned into a link. I see the bare `</code>` in your post, but that should have been written (I think) as `<code> if($seqR =~m/\G([A-Z]{3,7})+?$_/g) { #print "All repeated sequences $_\n"; }; </code>` [download] This also sounds a bit like homework, and a bit like you are getting ahead of yourself. As discussed in How do I post a question effectively?, think about what you are trying to do and where you are hitting issues. Are you sure your randomly generated strings contain what you think they do?. Where did you get the code you've posted. You have a `\G` in there, which does not really make sense in the context you're calling here. See Assertions in perlre for documentation. You are also inlining `$_` into your regex, which makes no sense to me. And going for a non-greedy match, which is strange as well (see Metacharacters). What was the source material for this construction? Have you read perlretut for a basic description of what regular expression do? You probably want to use backreferences -- see Capture groups in perlre. Perhaps you confused `\G` and `\g`? You also probably want to use pos, $& and/or $-[0] in some combination to figure out positions, however you'd like to define that. So ending up with something like: `while($seqR =~m/([ATGC])\g1{2,6})+/g) { my $position = pos; print "$& found at $position\n" };` [download] #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re: find all repeating sequences by kschwab (Vicar) on Dec 20, 2016 at 19:54 UTC
If I'm understanding you correctly: `#!/usr/bin/perl my @seqs=qw(AAAA ABAB CCC DDDDDDD EEEEFFF); foreach my $seq (@seqs) { if($seq =~m/^(\w)\1{2,6}$/) { print "$seq: REPEATED\n"; } else { print "$seq: NOT REPEATED\n"; } } Outputs: AAAA: REPEATED ABAB: NOT REPEATED CCC: REPEATED DDDDDDD: REPEATED EEEEFFF: NOT REPEATED` [download] But that would only match things like AAAA versus things like AAAAABBB. Not sure exactly what behavior you're looking for. Edit: The magic here is backreferences and capture groups. See this bit in the perlre docs.	[reply] [d/l]
Re: find all repeating sequences by AnomalousMonk (Archbishop) on Dec 21, 2016 at 02:13 UTC
Your OP is a bit confusing (please see the comments of others) and I've made a few guesses and assumptions about what you want. In particular, I assume you're only dealing with `ATCG` bases; if not, you'll have to replace all (I think) of the `.` (dot) operators with `[ATCG]` character classes. In addition, you'll need Perl version 5.10 or later for the `\K` operator. Try this: Read more... (2 kB) Of course, the critical section of code is `my @runs; push @runs, [ $2, $-[2] ] while $s =~ m{ (?: \G \| (.) (?! \1)) \K ((.) \3{2,6} (?! \3)) }xmsg;` [download] where `$s` is your sequence. Update: The somewhat roundabout regex expression `(?: \G \| (.) (?! \1)) \K` in the foregoing is a positive look-behind that ensures that a sub-sequence of identical `ATCG` bases indeed is preceded by a transition from one base to another (or is at the start of the main sequence/string). This complexity is necessary because the more straightforward negative look-behind of `(?<! \2\2)` in, say, `((.) (?<! \2\2) \2{2,6} (?! \2))` will not compile: the `\2` backreference is seen as being variable-width even though its referent is the clearly fixed-width `(.)` capture. I think the regex compiler simply cannot detect the inherent single-character nature of this capture and assumes variable-width as the worst-case scenario. (The `\K` operator of Perl versions 5.10+ is effectively a variable width positive look-behind.) However, it occurred to me to try another tack. All look-arounds are zero-width (i.e., fixed-width) assertions. Is a look-ahead (which can be variable-width) embedded in a look-behind along with other fixed-width operators accepted as collectively fixed-width? It turns out it is. The look-behind `(?<! (?= \2\2) ..)` in the regex `((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2))` compiles and works just fine for all my test cases and is IMHO simpler and more maintainable than the original circumlocution. It's also pre-5.10 compatible (as far as I can test). So then the critical section of code becomes `my @runs; push @runs, [ $1, $-[1] ] while $sequence =~ m{ ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2)) }xmsg;` [download] Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]