in reply to find all repeating sequences
Your OP is a bit confusing (please see the comments of others) and I've made a few guesses and assumptions about what you want. In particular, I assume you're only dealing with ATCG bases; if not, you'll have to replace all (I think) of the . (dot) operators with [ATCG] character classes. In addition, you'll need Perl version 5.10 or later for the \K operator. Try this:
c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; SEQUENCE: for my $s (qw( A AA AAAAAAAA ATCG AATTCCGG AAAAAAAATTTTTTTT AAA AAAA AAAAA AAAAAA AAAAAAA TAAA TAAAA TAAAAA TAAAAAA TAAAAAAA AAAT AAAAT AAAAAT AAAAAAT AAAAAAAT TAAAT TAAAAT TAAAAAT TAAAAAAT TAAAAAAAT ATTAAATTTTCCCCCGGGGGGAAAAAAATTTTTTTT ATTAAACTTTTGCCCCCAGGGGGGCAAAAAAATTTTTTTT ATTAAACCTTTTGGCCCCCAAGGGGGGCCAAAAAAATTTTTTTT ), @ARGV) { my @runs; push @runs, [ $2, $-[2] ] while $s =~ m{ (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3)) }xmsg; ;; print qq{$s}; print qq{no run(s) \n} and next SEQUENCE unless @runs; for my $ar_run (@runs) { my ($run, $offset) = @$ar_run; printf qq{%*s at offset %d \n}, $offset+length($run), qq{$run}, $offset; } print ''; } " A no run(s) AA no run(s) AAAAAAAA no run(s) ATCG no run(s) AATTCCGG no run(s) AAAAAAAATTTTTTTT no run(s) AAA AAA at offset 0 AAAA AAAA at offset 0 AAAAA AAAAA at offset 0 AAAAAA AAAAAA at offset 0 AAAAAAA AAAAAAA at offset 0 TAAA AAA at offset 1 TAAAA AAAA at offset 1 TAAAAA AAAAA at offset 1 TAAAAAA AAAAAA at offset 1 TAAAAAAA AAAAAAA at offset 1 AAAT AAA at offset 0 AAAAT AAAA at offset 0 AAAAAT AAAAA at offset 0 AAAAAAT AAAAAA at offset 0 AAAAAAAT AAAAAAA at offset 0 TAAAT AAA at offset 1 TAAAAT AAAA at offset 1 TAAAAAT AAAAA at offset 1 TAAAAAAT AAAAAA at offset 1 TAAAAAAAT AAAAAAA at offset 1 ATTAAATTTTCCCCCGGGGGGAAAAAAATTTTTTTT AAA at offset 3 TTTT at offset 6 CCCCC at offset 10 GGGGGG at offset 15 AAAAAAA at offset 21 ATTAAACTTTTGCCCCCAGGGGGGCAAAAAAATTTTTTTT AAA at offset 3 TTTT at offset 7 CCCCC at offset 12 GGGGGG at offset 18 AAAAAAA at offset 25 ATTAAACCTTTTGGCCCCCAAGGGGGGCCAAAAAAATTTTTTTT AAA at offset 3 TTTT at offset 8 CCCCC at offset 14 GGGGGG at offset 21 AAAAAAA at offset 29
where $s is your sequence.my @runs; push @runs, [ $2, $-[2] ] while $s =~ m{ (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3)) }xmsg;
Update: The somewhat roundabout regex expression
(?: \G | (.) (?! \1)) \K
in the foregoing is a positive look-behind that ensures that a sub-sequence of identical ATCG bases indeed is preceded by a transition from one base to another (or is at the start of the main sequence/string). This complexity is necessary because the more straightforward negative look-behind of
(?<! \2\2)
in, say,
((.) (?<! \2\2) \2{2,6} (?! \2))
will not compile: the \2 backreference is seen as being variable-width even though its referent is the clearly fixed-width (.) capture. I think the regex compiler simply cannot detect the inherent single-character nature of this capture and assumes variable-width as the worst-case scenario. (The \K operator of Perl versions 5.10+ is effectively a variable width positive look-behind.)
However, it occurred to me to try another tack. All look-arounds are zero-width (i.e., fixed-width) assertions. Is a look-ahead (which can be variable-width) embedded in a look-behind along with other fixed-width operators accepted as collectively fixed-width? It turns out it is. The look-behind
(?<! (?= \2\2) ..)
in the regex
((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2))
compiles and works just fine for all my test cases and is IMHO simpler and more maintainable than the original circumlocution. It's also pre-5.10 compatible (as far as I can test).
So then the critical section of code becomes
my @runs; push @runs, [ $1, $-[1] ] while $sequence =~ m{ ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2)) }xmsg;
Give a man a fish: <%-{-{-{-<
|
|---|