Dear Monks,
I don't understand the following passage in perlretut:
"\G" is also invaluable in processing fixed length records with
regexps. Suppose we have a snippet of coding region DNA, encoded as
base pair letters "ATCGTTGAAT..." and we want to find all the stop
codons "TGA". In a coding region, codons are 3-letter sequences, so we
can think of the DNA snippet as a sequence of 3-letter records. The
naive regexp
# expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC"
$dna = "ATCGTTGAATGCAAATGACATGAC";
$dna =~ /TGA/;
doesn't work; it may match a "TGA", but there is no guarantee that the
match is aligned with codon boundaries, e.g., the substring "GTT GAA"
gives a match. A better solution is
while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *?
print "Got a TGA stop codon at position ", pos $dna, "\n";
}
which prints
Got a TGA stop codon at position 18
Got a TGA stop codon at position 23
Position 18 is good, but position 23 is bogus. What happened?
The answer is that our regexp works well until we get past the last
real match. Then the regexp will fail to match a synchronized "TGA"
and start stepping ahead one character position at a time, not what we
want. The solution is to use "\G" to anchor the match to the codon
alignment:
while ($dna =~ /\G(\w\w\w)*?TGA/g) {
print "Got a TGA stop codon at position ", pos $dna, "\n";
}
This prints
Got a TGA stop codon at position 18
which is the correct answer. This example illustrates that it is
important not only to match what is desired, but to reject what is not
desired.
I don't understand why the \G is necessary, when the /g modifier is being used. Assuming no \G, it seems to me that once the first TGA is matched, then the pos() is here:
TGA CAT GA ^
Note the spaces aren't actually part of the string. Next, the regex would look for 0 matches for (\w\w\w) followed by TGA, which is equivalent to just TGA. Clearly TGA is not present at pos(). Next, I would think the regex would look for (\w\w\w) one time, followed by TGA. That also does not occur at pos(). So how does that second TGA, which spans the space, match?
In reply to \G and regexes by 7stud
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |