7stud has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I don't understand the following passage in perlretut:

       "\G" is also invaluable in processing fixed length records with
       regexps.  Suppose we have a snippet of coding region DNA, encoded as
       base pair letters "ATCGTTGAAT..." and we want to find all the stop
       codons "TGA".  In a coding region, codons are 3-letter sequences, so we
       can think of the DNA snippet as a sequence of 3-letter records.  The
       naive regexp

           # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
           $dna = "ATCGTTGAATGCAAATGACATGAC";
           $dna =~ /TGA/;

       doesn't work; it may match a "TGA", but there is no guarantee that the
       match is aligned with codon boundaries, e.g., the substring "GTT GAA"
       gives a match.  A better solution is

           while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
               print "Got a TGA stop codon at position ", pos $dna, "\n";
           }

       which prints

           Got a TGA stop codon at position 18
           Got a TGA stop codon at position 23

       Position 18 is good, but position 23 is bogus.  What happened?

       The answer is that our regexp works well until we get past the last
       real match.  Then the regexp will fail to match a synchronized "TGA"
       and start stepping ahead one character position at a time, not what we
       want.  The solution is to use "\G" to anchor the match to the codon
       alignment:

           while ($dna =~ /\G(\w\w\w)*?TGA/g) {
               print "Got a TGA stop codon at position ", pos $dna, "\n";
           }

       This prints

           Got a TGA stop codon at position 18

       which is the correct answer.  This example illustrates that it is
       important not only to match what is desired, but to reject what is not
       desired.

I don't understand why the \G is necessary, when the /g modifier is being used. Assuming no \G, it seems to me that once the first TGA is matched, then the pos() is here:

TGA CAT GA ^

Note the spaces aren't actually part of the string. Next, the regex would look for 0 matches for (\w\w\w) followed by TGA, which is equivalent to just TGA. Clearly TGA is not present at pos(). Next, I would think the regex would look for (\w\w\w) one time, followed by TGA. That also does not occur at pos(). So how does that second TGA, which spans the space, match?

Replies are listed 'Best First'.
Re: \G and regexes
by moritz (Cardinal) on Apr 05, 2010 at 20:02 UTC
    If there is no anchoring, the regex can match anywhere inside the string. So given the string AAAATGA, the match can work as follow:
    AAAATGA XXX TGA

    where XXX stands for the letters matched by (\w\w\w). The first A in the string is not matched by anything. Using \G prevents that.

    Perl 6 - links to (nearly) everything that is Perl 6.

      Ok. After some tests, I get what you are saying.

      The /g flag does not require that further matching must start past the end of the previous match--it just makes a request for any other unique matches. And the regex /(\w\w\w)*?TGA/ can act like the regex /TGA/ because a regex will happily match nothing for *.

      But then why doesn't the /g flag cause this to match four times:

      use strict; use warnings; use 5.010; my $str = 'aaaBBBcccTGA'; while ($str =~ /(?:\w\w\w)*?(TGA)/g) { say $1; say pos $str; }

      Aren't there four unique matches:

      1)  when (\w\w\w) is matched 0 times.
      2)  when (\w\w\w) is matched 1 time.
      3)  when (\w\w\w) is matched 2 times.
      4)  when (\w\w\w) is matched 3 times.

      That suggests that another match must end past the previous match--but that the next match doesn't have to start past the previous match.

      My tests also show that starting the regex with a \A to anchor it to the beginning of the string will cause the regex to match only once:

      use strict; use warnings; use 5.010; my $str = 'aaaBBBcccTGAddTGA'; while ($str =~ /\A(?:\w\w\w)*?(TGA)/g) { say $1; say pos $str; } --output:-- TGA 12

      But then I would expect this to match twice, and it doesn't:

      use strict; use warnings; use 5.010; my $str = 'aaaBBBcccTGAdddTGA'; while ($str =~ /\A(?:\w\w\w)*?(TGA)/g) { say $1; say pos $str; }

      So I guess I don't have any idea what's going on.

        The /g flag does not require that further matching must start past the end of the previous match--it just makes a request for any other unique matches.

        No. It makes a request for another match to the leftright of the end of the previous match. The first match goes up to position 12:

        aaaBBBcccTGAddTGA XXXXXXXXX TGA ^ match ends here
        So the next match sees only <c>ddTGA</cc> to match against.
        Perl 6 - links to (nearly) everything that is Perl 6.
Re: \G and regexes
by Anonymous Monk on Apr 05, 2010 at 20:19 UTC
    use re 'debug';

    shows you what the re engine is doing

    perl -Mre=debug -le"print $1 while q!bbbbabcabc! =~ /\G(\w\w\w)*?abc/g +" perl -Mre=debug -le"print $1 while q!bbbbabcabc! =~ /(\w\w\w)*?abc/g"
    The string bbbbabcabc (bbb bab cab c), there is no 3 letter string followed by abc. If you don't use \G you wlll get a match.
Re: \G and regexes
by choroba (Cardinal) on Apr 05, 2010 at 20:05 UTC
    The second TGA will match here:
    TGA CAT GAC ^ ^^
    Because without /G, it will not start matching with three characters steps from the last match.
      I know that. The question is why.
        Because the \w\w\w part will match ACA and TGA will match TGA. The regexp does not specify the matching cannot start anywhere in the string, and that's what /G does.
Re: \G and regexes
by biohisham (Priest) on Apr 06, 2010 at 08:58 UTC
    Suppose we have a snippet of coding region DNA, encoded as base pair letters "ATCGTTGAAT..."
    Interesting and excellent example, a base is a letter,a codon is 3-bases long, but, the snippet above denotes a single base letter and not a paired base letter - where every opposing base is counted as a letter -
    # 10 base pairs ATCGTTGAAT TAGCAACTTA # 10 bases TAGCAACTTA
    how can we rectify that to maintain relevance to Perl and Biology ??


    Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.
Re: \G and regexes
by Anonymous Monk on Apr 06, 2010 at 19:33 UTC
    The lack of any sort of anchor in the original pattern allows for any number of characters to occur before the pattern, so you are not guaranteed to fall on a codon boundary.

     XTGAXX would be a valid match for /(\w\w\w)*?TGA/ The simple solution is to anchor that pattern to the start of the string:

    /^(\w\w\w)*?TGA/g
    The down side is the potential to be very slow depending on how the regex engine handles the global match.

    \G forces the next match to start after the previous one, preventing any expensive backtracking.

    --Greg