Dear Monks,

I don't understand the following passage in perlretut:

       "\G" is also invaluable in processing fixed length records with
       regexps.  Suppose we have a snippet of coding region DNA, encoded as
       base pair letters "ATCGTTGAAT..." and we want to find all the stop
       codons "TGA".  In a coding region, codons are 3-letter sequences, so we
       can think of the DNA snippet as a sequence of 3-letter records.  The
       naive regexp

           # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
           $dna = "ATCGTTGAATGCAAATGACATGAC";
           $dna =~ /TGA/;

       doesn't work; it may match a "TGA", but there is no guarantee that the
       match is aligned with codon boundaries, e.g., the substring "GTT GAA"
       gives a match.  A better solution is

           while ($dna =~ /(\w\w\w)*?TGA/g) {  # note the minimal *?
               print "Got a TGA stop codon at position ", pos $dna, "\n";
           }

       which prints

           Got a TGA stop codon at position 18
           Got a TGA stop codon at position 23

       Position 18 is good, but position 23 is bogus.  What happened?

       The answer is that our regexp works well until we get past the last
       real match.  Then the regexp will fail to match a synchronized "TGA"
       and start stepping ahead one character position at a time, not what we
       want.  The solution is to use "\G" to anchor the match to the codon
       alignment:

           while ($dna =~ /\G(\w\w\w)*?TGA/g) {
               print "Got a TGA stop codon at position ", pos $dna, "\n";
           }

       This prints

           Got a TGA stop codon at position 18

       which is the correct answer.  This example illustrates that it is
       important not only to match what is desired, but to reject what is not
       desired.

I don't understand why the \G is necessary, when the /g modifier is being used. Assuming no \G, it seems to me that once the first TGA is matched, then the pos() is here:

TGA CAT GA ^

Note the spaces aren't actually part of the string. Next, the regex would look for 0 matches for (\w\w\w) followed by TGA, which is equivalent to just TGA. Clearly TGA is not present at pos(). Next, I would think the regex would look for (\w\w\w) one time, followed by TGA. That also does not occur at pos(). So how does that second TGA, which spans the space, match?


In reply to \G and regexes by 7stud

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.