Re^4: Understanding a portion of perlretut

Replies are listed 'Best First'.

Re^5: Understanding a portion of perlretut
by BlueStarry (Novice) on Dec 09, 2015 at 16:15 UTC

ATCGTT = (\w\w\w)*?

Re^6: Understanding a portion of perlretut

by AnomalousMonk (Archbishop) on Dec 09, 2015 at 22:04 UTC

Here's another way to look at things: instrument the regex with (?{ code }) (see Extended Patterns) print points to learn by experimentation. I'm also taking the liberty of introducing some other new constructs: the (?:pattern) non-capturing grouping (also see Extended Patterns); the /x regex modifier (all the preceding links found in perlre); and the @- (aka @LAST_MATCH_START) array regex special variable (see perlvar).

First look at TGA matching against a simplified string without a \G anchor. Note that in contrast to some other code examples in this thread, the beginning offset of a match is reported.

c:\@Work\Perl>perl -wMstrict -le
"my $s = 'XXXxxxTGAxxTGAxxxxxxx';
 while ($s =~ m{ (?{ print qq{trying a match at offset }, pos $s }) (?
+: \w\w\w)*? (TGA) }xmsg) {
     print qq{matched TGA beginning at offset $-[1]};
     }
"
trying a match at offset 0
matched TGA beginning at offset 6
trying a match at offset 9
trying a match at offset 10
trying a match at offset 11
matched TGA beginning at offset 11
[download]

TGA

spurious

Now consider the effect of adding a \G anchor assertion.

c:\@Work\Perl>perl -wMstrict -le
"my $s = 'XXXxxxTGAxxTGAxxxxxxx';
 while ($s =~ m{ \G (?{ print qq{trying a match at offset }, pos $s })
+ (?: \w\w\w)*? (TGA) }xmsg) {
     print qq{matched TGA beginning at offset $-[1]};
     }
"
trying a match at offset 0
matched TGA beginning at offset 6
trying a match at offset 9
[download]

at the offset immediately beyond the point at which the previous successful match ended,

\G

Supplemental: We just got finished saying that in

my $s = 'XXXxxxTGAxxTGAxxxxxxx';
while ($s =~ m{ (?{ print qq{trying a match at offset }, pos $s }) (?:
+ \w\w\w)*? (TGA) }xmsg) {
  print qq{matched TGA beginning at offset $-[1]};
  }
[download]

TGA

\G

c:\@Work\Perl>perl -wMstrict -le
"my $s = 'XXXxxxTGAxxTGAxxxxTGAxx';
 while ($s =~ m{ (?{ print qq{trying a match at offset }, pos $s }) (?
+: \w\w\w)*? (TGA) }xmsg) {
   print qq{matched TGA beginning at offset $-[1]};
   }
"
trying a match at offset 0
matched TGA beginning at offset 6
trying a match at offset 9
matched TGA beginning at offset 18
[download]

\G

miss

TGA

Give a man a fish: <%-{-{-{-<

[reply]
[d/l]
[select]

Re^7: Understanding a portion of perlretut

by Athanasius (Archbishop) on Dec 10, 2015 at 09:54 UTC

Hello AnomalousMonk, and thanks for the writeup.

Is your Supplemental question meant rhetorically? Because I, for one, would really like to know the answer! Consider the following:

19:40 >perl -wE "my $s = 'abCdefC'; while ($s =~ / (f)*? C /gx) { say 
+qq[match: $1, pos = ], pos $s; }"
Use of uninitialized value $1 in concatenation (.) or string at -e lin
+e 1.
match: , pos = 3
match: f, pos = 7

19:41 >
[download]

In this case the regex engine does not say: “Well, matching 1 f is better than matching none, so I’ll match the fC sequence first.” On the contrary, it first matches the C preceded by zero fs, as one might reasonably expect from the quantifier, *?, which says, “match this zero or more times in a non-greedy way.” I actually don’t understand how this can ever, logically, match with more than zero, since zero is possible and less greedy than 1??

But back to the OP, how does $dna =~ / (\w\w\w)*? TGA /gx differ logically from $s =~ / (f)*? C /gx? I can’t see a fundamental distinction here, yet there must be one because the former finds a non-empty match for (\w\w\w) before it looks for an empty one!

Thanks,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,