charm has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to find all repeats (AAAA TTT GGGG CCCC) in a randomly generated sequence with length from 3 to 7bp, and their respective positions using Regex. I have tried the following code

if($seqR =~m/\G([A-Z]{3,7})+?$_/g) { #print "All repeated sequences $_\n"; };

it doesn't work.

Replies are listed 'Best First'.
Re: find all repeating sequences
by Marshall (Canon) on Dec 20, 2016 at 20:41 UTC
    I believe you want: a) the match and also b) the position of the match? Modifying kschwab's regex..
    #!/usr/bin/perl use strict; use warnings; my $string = 'AAAA TTT GGGG CCCC AAAAAA CCCC ABAB CCC TTT TTTT GGGEGEE +'; while ($string =~m/\b((\w)\2{2,6})\b/g) { print "$-[1] $1 \n"; } __END__ #prints 0 AAAA 5 TTT 9 GGGG 14 CCCC 19 AAAAAA 26 CCCC 36 CCC 40 TTT 44 TTTT
    Perlvar mentions @- and @+ although the explanation there is not completely clear.
    In Perl 5.6.0 the "@-" and "@+" dynamic arrays were introduced that supply the indices of successful matches. @- is the beginning and @+ is the ending. $-[0] and $+[0] correspond to entire pattern, while $-[N] and $+[N] correspond to the $N ($1, $2, etc.) submatches.

      Wow. I never knew about @- and @+ .. that's really quite brilliant. Thanks!!!!

      Alex / talexb / Toronto

      Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Thank you! This is exactly what I needed
Re: find all repeating sequences
by kennethk (Abbot) on Dec 20, 2016 at 20:45 UTC
    Please read Markup in the Monastery. Note how your square backets have been turned into a link. I see the bare </code> in your post, but that should have been written (I think) as
    <code> if($seqR =~m/\G([A-Z]{3,7})+?$_/g) { #print "All repeated sequences $_\n"; }; </code>

    This also sounds a bit like homework, and a bit like you are getting ahead of yourself. As discussed in How do I post a question effectively?, think about what you are trying to do and where you are hitting issues. Are you sure your randomly generated strings contain what you think they do?.

    Where did you get the code you've posted. You have a \G in there, which does not really make sense in the context you're calling here. See Assertions in perlre for documentation. You are also inlining $_ into your regex, which makes no sense to me. And going for a non-greedy match, which is strange as well (see Metacharacters). What was the source material for this construction? Have you read perlretut for a basic description of what regular expression do?

    You probably want to use backreferences -- see Capture groups in perlre. Perhaps you confused \G and \g? You also probably want to use pos, $& and/or $-[0] in some combination to figure out positions, however you'd like to define that. So ending up with something like:

    while($seqR =~m/([ATGC])\g1{2,6})+/g) { my $position = pos; print "$& found at $position\n" };

    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: find all repeating sequences
by kschwab (Vicar) on Dec 20, 2016 at 19:54 UTC

    If I'm understanding you correctly:

    #!/usr/bin/perl my @seqs=qw(AAAA ABAB CCC DDDDDDD EEEEFFF); foreach my $seq (@seqs) { if($seq =~m/^(\w)\1{2,6}$/) { print "$seq: REPEATED\n"; } else { print "$seq: NOT REPEATED\n"; } } Outputs: AAAA: REPEATED ABAB: NOT REPEATED CCC: REPEATED DDDDDDD: REPEATED EEEEFFF: NOT REPEATED

    But that would only match things like AAAA versus things like AAAAABBB. Not sure exactly what behavior you're looking for.

    Edit: The magic here is backreferences and capture groups. See this bit in the perlre docs.
Re: find all repeating sequences
by AnomalousMonk (Archbishop) on Dec 21, 2016 at 02:13 UTC

    Your OP is a bit confusing (please see the comments of others) and I've made a few guesses and assumptions about what you want. In particular, I assume you're only dealing with  ATCG bases; if not, you'll have to replace all (I think) of the  . (dot) operators with  [ATCG] character classes. In addition, you'll need Perl version 5.10 or later for the  \K operator. Try this:

    Of course, the critical section of code is
    my @runs; push @runs, [ $2, $-[2] ] while $s =~ m{ (?: \G | (.) (?! \1)) \K ((.) \3{2,6} (?! \3)) }xmsg;
    where  $s is your sequence.

    Update: The somewhat roundabout regex expression
        (?: \G | (.) (?! \1)) \K
    in the foregoing is a positive look-behind that ensures that a sub-sequence of identical  ATCG bases indeed is preceded by a transition from one base to another (or is at the start of the main sequence/string). This complexity is necessary because the more straightforward negative look-behind of
         (?<! \2\2)
    in, say,
        ((.) (?<! \2\2) \2{2,6} (?! \2))
    will not compile: the  \2 backreference is seen as being variable-width even though its referent is the clearly fixed-width  (.) capture. I think the regex compiler simply cannot detect the inherent single-character nature of this capture and assumes variable-width as the worst-case scenario. (The  \K operator of Perl versions 5.10+ is effectively a variable width positive look-behind.)

    However, it occurred to me to try another tack. All look-arounds are zero-width (i.e., fixed-width) assertions. Is a look-ahead (which can be variable-width) embedded in a look-behind along with other fixed-width operators accepted as collectively fixed-width? It turns out it is. The look-behind
        (?<! (?= \2\2) ..)
    in the regex
        ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2))
    compiles and works just fine for all my test cases and is IMHO simpler and more maintainable than the original circumlocution. It's also pre-5.10 compatible (as far as I can test).

    So then the critical section of code becomes

    my @runs; push @runs, [ $1, $-[1] ] while $sequence =~ m{ ((.) (?<! (?= \2\2) ..) \2{2,6} (?! \2)) }xmsg;


    Give a man a fish:  <%-{-{-{-<