Your OP is a bit confusing (please see the comments of others) and I've made a few guesses and assumptions about what you want. In particular, I assume you're only dealing with  ATCG bases; if not, you'll have to replace all (I think) of the  . (dot) operators with  [ATCG] character classes. In addition, you'll need Perl version 5.10 or later for the  \K operator. Try this:

c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; SEQUENCE: for my $s (qw( A AA AAAAAAAA ATCG AATTCCGG AAAAAAAATTTTTTTT AAA AAAA AAAAA AAAAAA AAAAAAA TAAA TAAAA TAAAAA TAAAAAA TAAAAAAA AAAT AAAAT AAAAAT AAAAAAT AAAAAAAT TAAAT TAAAAT TAAAAAT TAAAAAAT TAAAAAAAT ATTAAATTTTCCCCCGGGGGGAAAAAAATTTTTTTT ATTAAACTTTTGCCCCCAGGGGGGCAAAAAAATTTTTTTT ATTAAACCTTTTGGCCCCCAAGGGGGGCCAAAAAAATTTTTTTT ), @ARGV) { my @runs; push @runs, [ $2, $-[2] ] while $s =~ m{ (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3)) }xmsg; ;; print qq{$s}; print qq{no run(s) \n} and next SEQUENCE unless @runs; for my $ar_run (@runs) { my ($run, $offset) = @$ar_run; printf qq{%*s at offset %d \n}, $offset+length($run), qq{$run}, $offset; } print ''; } " A no run(s) AA no run(s) AAAAAAAA no run(s) ATCG no run(s) AATTCCGG no run(s) AAAAAAAATTTTTTTT no run(s) AAA AAA at offset 0 AAAA AAAA at offset 0 AAAAA AAAAA at offset 0 AAAAAA AAAAAA at offset 0 AAAAAAA AAAAAAA at offset 0 TAAA AAA at offset 1 TAAAA AAAA at offset 1 TAAAAA AAAAA at offset 1 TAAAAAA AAAAAA at offset 1 TAAAAAAA AAAAAAA at offset 1 AAAT AAA at offset 0 AAAAT AAAA at offset 0 AAAAAT AAAAA at offset 0 AAAAAAT AAAAAA at offset 0 AAAAAAAT AAAAAAA at offset 0 TAAAT AAA at offset 1 TAAAAT AAAA at offset 1 TAAAAAT AAAAA at offset 1 TAAAAAAT AAAAAA at offset 1 TAAAAAAAT AAAAAAA at offset 1 ATTAAATTTTCCCCCGGGGGGAAAAAAATTTTTTTT AAA at offset 3 TTTT at offset 6 CCCCC at offset 10 GGGGGG at offset 15 AAAAAAA at offset 21 ATTAAACTTTTGCCCCCAGGGGGGCAAAAAAATTTTTTTT AAA at offset 3 TTTT at offset 7 CCCCC at offset 12 GGGGGG at offset 18 AAAAAAA at offset 25 ATTAAACCTTTTGGCCCCCAAGGGGGGCCAAAAAAATTTTTTTT AAA at offset 3 TTTT at offset 8 CCCCC at offset 14 GGGGGG at offset 21 AAAAAAA at offset 29
Of course, the critical section of code is
my @runs; push @runs, [ $2, $-[2] ] while $s =~ m{ (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3)) }xmsg;
where  $s is your sequence.

Update: The somewhat roundabout regex expression
    (?: \G | (.) (?! \1)) \K
in the foregoing is a positive look-behind that ensures that a sub-sequence of identical  ATCG bases indeed is preceded by a transition from one base to another (or is at the start of the main sequence/string). This complexity is necessary because the more straightforward negative look-behind of
     (?<! \2\2)
in, say,
    ((.) (?<! \2\2) \2{2,6} (?! \2))
will not compile: the  \2 backreference is seen as being variable-width even though its referent is the clearly fixed-width  (.) capture. I think the regex compiler simply cannot detect the inherent single-character nature of this capture and assumes variable-width as the worst-case scenario. (The  \K operator of Perl versions 5.10+ is effectively a variable width positive look-behind.)

However, it occurred to me to try another tack. All look-arounds are zero-width (i.e., fixed-width) assertions. Is a look-ahead (which can be variable-width) embedded in a look-behind along with other fixed-width operators accepted as collectively fixed-width? It turns out it is. The look-behind
    (?<! (?= \2\2) ..)
in the regex
    ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2))
compiles and works just fine for all my test cases and is IMHO simpler and more maintainable than the original circumlocution. It's also pre-5.10 compatible (as far as I can test).

So then the critical section of code becomes

my @runs; push @runs, [ $1, $-[1] ] while $sequence =~ m{ ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2)) }xmsg;


Give a man a fish:  <%-{-{-{-<


In reply to Re: find all repeating sequences by AnomalousMonk
in thread find all repeating sequences by charm

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.