comment on

Your OP is a bit confusing (please see the comments of others) and I've made a few guesses and assumptions about what you want. In particular, I assume you're only dealing with ATCG bases; if not, you'll have to replace all (I think) of the . (dot) operators with [ATCG] character classes. In addition, you'll need Perl version 5.10 or later for the \K operator. Try this:

c:\@Work\Perl\monks>perl -wMstrict -le
"use 5.010;
 ;;
 SEQUENCE:
 for my $s (qw(
   A AA AAAAAAAA ATCG AATTCCGG AAAAAAAATTTTTTTT
   AAA AAAA AAAAA AAAAAA AAAAAAA
   TAAA TAAAA TAAAAA TAAAAAA TAAAAAAA
   AAAT AAAAT AAAAAT AAAAAAT AAAAAAAT
   TAAAT TAAAAT TAAAAAT TAAAAAAT TAAAAAAAT
   ATTAAATTTTCCCCCGGGGGGAAAAAAATTTTTTTT
   ATTAAACTTTTGCCCCCAGGGGGGCAAAAAAATTTTTTTT
   ATTAAACCTTTTGGCCCCCAAGGGGGGCCAAAAAAATTTTTTTT
   ), @ARGV) {
   my @runs;
   push @runs, [ $2, $-[2] ] while $s =~ m{
     (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3))
     }xmsg;
   ;;
   print qq{$s};
   print qq{no run(s) \n} and next SEQUENCE unless @runs;
   for my $ar_run (@runs) {
     my ($run, $offset) = @$ar_run;
     printf qq{%*s at offset %d \n},
       $offset+length($run), qq{$run}, $offset;
     }
   print '';
   }
"
A
no run(s)

AA
no run(s)

AAAAAAAA
no run(s)

ATCG
no run(s)

AATTCCGG
no run(s)

AAAAAAAATTTTTTTT
no run(s)

AAA
AAA at offset 0

AAAA
AAAA at offset 0

AAAAA
AAAAA at offset 0

AAAAAA
AAAAAA at offset 0

AAAAAAA
AAAAAAA at offset 0

TAAA
 AAA at offset 1

TAAAA
 AAAA at offset 1

TAAAAA
 AAAAA at offset 1

TAAAAAA
 AAAAAA at offset 1

TAAAAAAA
 AAAAAAA at offset 1

AAAT
AAA at offset 0

AAAAT
AAAA at offset 0

AAAAAT
AAAAA at offset 0

AAAAAAT
AAAAAA at offset 0

AAAAAAAT
AAAAAAA at offset 0

TAAAT
 AAA at offset 1

TAAAAT
 AAAA at offset 1

TAAAAAT
 AAAAA at offset 1

TAAAAAAT
 AAAAAA at offset 1

TAAAAAAAT
 AAAAAAA at offset 1

ATTAAATTTTCCCCCGGGGGGAAAAAAATTTTTTTT
   AAA at offset 3
      TTTT at offset 6
          CCCCC at offset 10
               GGGGGG at offset 15
                     AAAAAAA at offset 21

ATTAAACTTTTGCCCCCAGGGGGGCAAAAAAATTTTTTTT
   AAA at offset 3
       TTTT at offset 7
            CCCCC at offset 12
                  GGGGGG at offset 18
                         AAAAAAA at offset 25

ATTAAACCTTTTGGCCCCCAAGGGGGGCCAAAAAAATTTTTTTT
   AAA at offset 3
        TTTT at offset 8
              CCCCC at offset 14
                     GGGGGG at offset 21
                             AAAAAAA at offset 29
[download]

Of course, the critical section of code is

my @runs;
push @runs, [ $2, $-[2] ] while $s =~ m{ 
  (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3))
  }xmsg;
[download]

where $s is your sequence.

Update: The somewhat roundabout regex expression
(?: \G | (.) (?! \1)) \K
in the foregoing is a positive look-behind that ensures that a sub-sequence of identical ATCG bases indeed is preceded by a transition from one base to another (or is at the start of the main sequence/string). This complexity is necessary because the more straightforward negative look-behind of
(?<! \2\2)
in, say,
((.) (?<! \2\2) \2{2,6} (?! \2))
will not compile: the \2 backreference is seen as being variable-width even though its referent is the clearly fixed-width (.) capture. I think the regex compiler simply cannot detect the inherent single-character nature of this capture and assumes variable-width as the worst-case scenario. (The \K operator of Perl versions 5.10+ is effectively a variable width positive look-behind.)

However, it occurred to me to try another tack. All look-arounds are zero-width (i.e., fixed-width) assertions. Is a look-ahead (which can be variable-width) embedded in a look-behind along with other fixed-width operators accepted as collectively fixed-width? It turns out it is. The look-behind
(?<! (?= \2\2) ..)
in the regex
((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2))
compiles and works just fine for all my test cases and is IMHO simpler and more maintainable than the original circumlocution. It's also pre-5.10 compatible (as far as I can test).

So then the critical section of code becomes

my @runs;                                      
push @runs, [ $1, $-[1] ] while $sequence =~ m{
  ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2))     
  }xmsg;
[download]

Give a man a fish: <%-{-{-{-<

In reply to Re: find all repeating sequences by AnomalousMonk
in thread find all repeating sequences by charm

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.