\G and regexes

7stud has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I don't understand the following passage in perlretut:

"\G" is also invaluable in processing fixed length records with regexps. Suppose we have a snippet of coding region DNA, encoded as base pair letters "ATCGTTGAAT..." and we want to find all the stop codons "TGA". In a coding region, codons are 3-letter sequences, so we can think of the DNA snippet as a sequence of 3-letter records. The naive regexp # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" $dna = "ATCGTTGAATGCAAATGACATGAC"; $dna =~ /TGA/; doesn't work; it may match a "TGA", but there is no guarantee that the match is aligned with codon boundaries, e.g., the substring "GTT GAA" gives a match. A better solution is while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? print "Got a TGA stop codon at position ", pos $dna, "\n"; } which prints Got a TGA stop codon at position 18 Got a TGA stop codon at position 23 Position 18 is good, but position 23 is bogus. What happened? The answer is that our regexp works well until we get past the last real match. Then the regexp will fail to match a synchronized "TGA" and start stepping ahead one character position at a time, not what we want. The solution is to use "\G" to anchor the match to the codon alignment: while ($dna =~ /\G(\w\w\w)*?TGA/g) { print "Got a TGA stop codon at position ", pos $dna, "\n"; } This prints Got a TGA stop codon at position 18 which is the correct answer. This example illustrates that it is important not only to match what is desired, but to reject what is not desired.

I don't understand why the \G is necessary, when the /g modifier is being used. Assuming no \G, it seems to me that once the first TGA is matched, then the pos() is here:

TGA CAT GA
    ^
[download]

Note the spaces aren't actually part of the string. Next, the regex would look for 0 matches for (\w\w\w) followed by TGA, which is equivalent to just TGA. Clearly TGA is not present at pos(). Next, I would think the regex would look for (\w\w\w) one time, followed by TGA. That also does not occur at pos(). So how does that second TGA, which spans the space, match?

Comment on \G and regexes Download Code

Replies are listed 'Best First'.
Re: \G and regexes by moritz (Cardinal) on Apr 05, 2010 at 20:02 UTC
If there is no anchoring, the regex can match anywhere inside the string. So given the string AAAATGA, the match can work as follow: `AAAATGA XXX TGA` [download] where XXX stands for the letters matched by `(\w\w\w)`. The first A in the string is not matched by anything. Using `\G` prevents that. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l] [select]
Re^2: \G and regexes by 7stud (Deacon) on Apr 05, 2010 at 20:37 UTC
Ok. After some tests, I get what you are saying. The /g flag does not require that further matching must start past the end of the previous match--it just makes a request for any other unique matches. And the regex /(\w\w\w)?TGA/ can act like the regex /TGA/ because a regex will happily match nothing for . But then why doesn't the /g flag cause this to match four times: `use strict; use warnings; use 5.010; my $str = 'aaaBBBcccTGA'; while ($str =~ /(?:\w\w\w)?(TGA)/g) { say $1; say pos $str; }` [download] Aren't there four unique matches: 1) when (\w\w\w) is matched 0 times. 2) when (\w\w\w) is matched 1 time. 3) when (\w\w\w) is matched 2 times. 4) when (\w\w\w) is matched 3 times. That suggests that another match must end past the previous match--but that the next match doesn't have to start past the previous match. My tests also show that starting the regex with a \A to anchor it to the beginning of the string will cause the regex to match only once: `use strict; use warnings; use 5.010; my $str = 'aaaBBBcccTGAddTGA'; while ($str =~ /\A(?:\w\w\w)?(TGA)/g) { say $1; say pos $str; } --output:-- TGA 12` [download] But then I would expect this to match twice, and it doesn't: `use strict; use warnings; use 5.010; my $str = 'aaaBBBcccTGAdddTGA'; while ($str =~ /\A(?:\w\w\w)*?(TGA)/g) { say $1; say pos $str; }` [download] So I guess I don't have any idea what's going on.	[reply] [d/l] [select]
Re^3: \G and regexes by moritz (Cardinal) on Apr 05, 2010 at 20:54 UTC
The /g flag does not require that further matching must start past the end of the previous match--it just makes a request for any other unique matches. No. It makes a request for another match to the ~~left~~right of the end of the previous match. The first match goes up to position 12: `aaaBBBcccTGAddTGA XXXXXXXXX TGA ^ match ends here` [download] So the next match sees only <c>ddTGA</cc> to match against. Perl 6 - links to (nearly) everything that is Perl 6.	[reply] [d/l]
Re^4: \G and regexes by 7stud (Deacon) on Apr 06, 2010 at 04:03 UTC
Re^5: \G and regexes by moritz (Cardinal) on Apr 06, 2010 at 06:53 UTC
Re: \G and regexes by Anonymous Monk on Apr 05, 2010 at 20:19 UTC
`use re 'debug';` shows you what the re engine is doing `perl -Mre=debug -le"print $1 while q!bbbbabcabc! =~ /\G(\w\w\w)?abc/g +" perl -Mre=debug -le"print $1 while q!bbbbabcabc! =~ /(\w\w\w)?abc/g"` [download] The string bbbbabcabc (bbb bab cab c), there is no 3 letter string followed by abc. Read more... (3 kB) If you don't use \G you wlll get a match. Read more... (5 kB)	[reply] [d/l] [select]
Re: \G and regexes by choroba (Cardinal) on Apr 05, 2010 at 20:05 UTC
The second TGA will match here: `TGA CAT GAC ^鵄^` [download] Because without /G, it will not start matching with three characters steps from the last match.	[reply] [d/l]
Re^2: \G and regexes by 7stud (Deacon) on Apr 05, 2010 at 20:26 UTC
I know that. The question is why.	[reply]
Re^3: \G and regexes by choroba (Cardinal) on Apr 05, 2010 at 20:32 UTC
Because the \w\w\w part will match ACA and TGA will match TGA. The regexp does not specify the matching cannot start anywhere in the string, and that's what /G does.	[reply]
Re^4: \G and regexes by choroba (Cardinal) on Apr 05, 2010 at 22:53 UTC
Re: \G and regexes by biohisham (Priest) on Apr 06, 2010 at 08:58 UTC
Suppose we have a snippet of coding region DNA, encoded as base pair letters "ATCGTTGAAT..." Interesting and excellent example, a base is a letter,a codon is 3-bases long, but, the snippet above denotes a single base letter and not a paired base letter - where every opposing base is counted as a letter - `# 10 base pairs ATCGTTGAAT TAGCAACTTA # 10 bases TAGCAACTTA` [download] how can we rectify that to maintain relevance to Perl and Biology ?? Excellence is an Endeavor of Persistence. Chance Favors a Prepared Mind.	[reply] [d/l] [select]
Re: \G and regexes by Anonymous Monk on Apr 06, 2010 at 19:33 UTC
The lack of any sort of anchor in the original pattern allows for any number of characters to occur before the pattern, so you are not guaranteed to fall on a codon boundary. `XTGAXX` would be a valid match for `/(\w\w\w)?TGA/` The simple solution is to anchor that pattern to the start of the string: `/^(\w\w\w)?TGA/g` [download] The down side is the potential to be very slow depending on how the regex engine handles the global match. `\G` forces the next match to start after the previous one, preventing any expensive backtracking. --Greg	[reply] [d/l] [select]