comment on

Dear Monks,

I don't understand the following passage in perlretut:

"\G" is also invaluable in processing fixed length records with regexps. Suppose we have a snippet of coding region DNA, encoded as base pair letters "ATCGTTGAAT..." and we want to find all the stop codons "TGA". In a coding region, codons are 3-letter sequences, so we can think of the DNA snippet as a sequence of 3-letter records. The naive regexp # expanded, this is "ATC GTT GAA TGC AAA TGA CAT GAC" $dna = "ATCGTTGAATGCAAATGACATGAC"; $dna =~ /TGA/; doesn't work; it may match a "TGA", but there is no guarantee that the match is aligned with codon boundaries, e.g., the substring "GTT GAA" gives a match. A better solution is while ($dna =~ /(\w\w\w)*?TGA/g) { # note the minimal *? print "Got a TGA stop codon at position ", pos $dna, "\n"; } which prints Got a TGA stop codon at position 18 Got a TGA stop codon at position 23 Position 18 is good, but position 23 is bogus. What happened? The answer is that our regexp works well until we get past the last real match. Then the regexp will fail to match a synchronized "TGA" and start stepping ahead one character position at a time, not what we want. The solution is to use "\G" to anchor the match to the codon alignment: while ($dna =~ /\G(\w\w\w)*?TGA/g) { print "Got a TGA stop codon at position ", pos $dna, "\n"; } This prints Got a TGA stop codon at position 18 which is the correct answer. This example illustrates that it is important not only to match what is desired, but to reject what is not desired.

I don't understand why the \G is necessary, when the /g modifier is being used. Assuming no \G, it seems to me that once the first TGA is matched, then the pos() is here:

TGA CAT GA
    ^
[download]

Note the spaces aren't actually part of the string. Next, the regex would look for 0 matches for (\w\w\w) followed by TGA, which is equivalent to just TGA. Clearly TGA is not present at pos(). Next, I would think the regex would look for (\w\w\w) one time, followed by TGA. That also does not occur at pos(). So how does that second TGA, which spans the space, match?

In reply to \G and regexes by 7stud

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.