Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^4: Understanding a portion of perlretut

by Corion (Patriarch)
on Dec 09, 2015 at 15:55 UTC ( [id://1149787]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Understanding a portion of perlretut
in thread Understanding a portion on the Perlretut

Step 2 is to try the leftmost part 1 time (no TGA found).

Step 3 is to try the leftmost part 2 times (no TGA found).

Step 4 (and this is where the naive part goes bad) is to advance the leftmost starting point by one, since the match is unanchored.

  • Comment on Re^4: Understanding a portion of perlretut

Replies are listed 'Best First'.
Re^5: Understanding a portion of perlretut
by BlueStarry (Novice) on Dec 09, 2015 at 16:15 UTC
    can you elaborate more please? I cannot understand you. Because following your steps, the short string wouldn't match. Trying the leftmost part 2 times means ATCGTT = (\w\w\w)*? ok? 2 times. But how it is possible that it matches on CGTTGA, i cannot understand.

      Here's another way to look at things: instrument the regex with  (?{ code }) (see Extended Patterns) print points to learn by experimentation. I'm also taking the liberty of introducing some other new constructs: the  (?:pattern) non-capturing grouping (also see Extended Patterns); the  /x regex modifier (all the preceding links found in perlre); and the  @- (aka @LAST_MATCH_START) array regex special variable (see perlvar).

      First look at TGA matching against a simplified string without a  \G anchor. Note that in contrast to some other code examples in this thread, the beginning offset of a match is reported.

      c:\@Work\Perl>perl -wMstrict -le "my $s = 'XXXxxxTGAxxTGAxxxxxxx'; while ($s =~ m{ (?{ print qq{trying a match at offset }, pos $s }) (? +: \w\w\w)*? (TGA) }xmsg) { print qq{matched TGA beginning at offset $-[1]}; } " trying a match at offset 0 matched TGA beginning at offset 6 trying a match at offset 9 trying a match at offset 10 trying a match at offset 11 matched TGA beginning at offset 11
      After the successful TGA match at offsets 6 thru 8, the regex engine starts trying to match again at offset 9. The RE tries matches at offsets 9, 10 and 11 and finds a spurious (because it's not on a base-triplet boundary) match at offset 11-13. (I'm not sure why the RE doesn't try matching from offset 14 onward.)

      Now consider the effect of adding a  \G anchor assertion.

      c:\@Work\Perl>perl -wMstrict -le "my $s = 'XXXxxxTGAxxTGAxxxxxxx'; while ($s =~ m{ \G (?{ print qq{trying a match at offset }, pos $s }) + (?: \w\w\w)*? (TGA) }xmsg) { print qq{matched TGA beginning at offset $-[1]}; } " trying a match at offset 0 matched TGA beginning at offset 6 trying a match at offset 9
      Now the RE can only begin another successful match at the offset immediately beyond the point at which the previous successful match ended, offset 9; it cannot try offsets 10 or 11 or any other because they do not satisfy the  \G assertion.

      Supplemental: We just got finished saying that in

      my $s = 'XXXxxxTGAxxTGAxxxxxxx'; while ($s =~ m{ (?{ print qq{trying a match at offset }, pos $s }) (?: + \w\w\w)*? (TGA) }xmsg) { print qq{matched TGA beginning at offset $-[1]}; }
      the RE will match the TGA at offset 11 because it's not constrained by a  \G assertion. So in
      c:\@Work\Perl>perl -wMstrict -le "my $s = 'XXXxxxTGAxxTGAxxxxTGAxx'; while ($s =~ m{ (?{ print qq{trying a match at offset }, pos $s }) (? +: \w\w\w)*? (TGA) }xmsg) { print qq{matched TGA beginning at offset $-[1]}; } " trying a match at offset 0 matched TGA beginning at offset 6 trying a match at offset 9 matched TGA beginning at offset 18
      (still no \G), why does the RE miss the TGA at offset 11 when there is another TGA present at offset 18 (which it does match)?


      Give a man a fish:  <%-{-{-{-<

        Hello AnomalousMonk, and thanks for the writeup.

        Is your Supplemental question meant rhetorically? Because I, for one, would really like to know the answer! Consider the following:

        19:40 >perl -wE "my $s = 'abCdefC'; while ($s =~ / (f)*? C /gx) { say +qq[match: $1, pos = ], pos $s; }" Use of uninitialized value $1 in concatenation (.) or string at -e lin +e 1. match: , pos = 3 match: f, pos = 7 19:41 >

        In this case the regex engine does not say: “Well, matching 1 f is better than matching none, so I’ll match the fC sequence first.” On the contrary, it first matches the C preceded by zero fs, as one might reasonably expect from the quantifier, *?, which says, “match this zero or more times in a non-greedy way.” I actually don’t understand how this can ever, logically, match with more than zero, since zero is possible and less greedy than 1??

        But back to the OP, how does $dna =~ / (\w\w\w)*? TGA /gx differ logically from $s =~ / (f)*? C /gx? I can’t see a fundamental distinction here, yet there must be one because the former finds a non-empty match for (\w\w\w) before it looks for an empty one!

        Thanks,

        Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        I really appreciate your effort to my matter. Thank you very much sir.
        Fascinated by your post i cast a stone in the lake (of my ignorance)..
        at offset 11 it is not to late to have 3 chars before TGA?

        L*
        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1149787]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-19 01:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found