in reply to Re^4: Understanding a portion of perlretut
in thread Understanding a portion on the Perlretut
|
|---|
| Replies are listed 'Best First'. | |||
|---|---|---|---|
|
Re^6: Understanding a portion of perlretut
by AnomalousMonk (Archbishop) on Dec 09, 2015 at 22:04 UTC | |||
Here's another way to look at things: instrument the regex with (?{ code }) (see Extended Patterns) print points to learn by experimentation. I'm also taking the liberty of introducing some other new constructs: the (?:pattern) non-capturing grouping (also see Extended Patterns); the /x regex modifier (all the preceding links found in perlre); and the @- (aka @LAST_MATCH_START) array regex special variable (see perlvar). First look at TGA matching against a simplified string without a \G anchor. Note that in contrast to some other code examples in this thread, the beginning offset of a match is reported. After the successful TGA match at offsets 6 thru 8, the regex engine starts trying to match again at offset 9. The RE tries matches at offsets 9, 10 and 11 and finds a spurious (because it's not on a base-triplet boundary) match at offset 11-13. (I'm not sure why the RE doesn't try matching from offset 14 onward.) Now consider the effect of adding a \G anchor assertion. Now the RE can only begin another successful match at the offset immediately beyond the point at which the previous successful match ended, offset 9; it cannot try offsets 10 or 11 or any other because they do not satisfy the \G assertion. Supplemental: We just got finished saying that in the RE will match the TGA at offset 11 because it's not constrained by a \G assertion. So in (still no \G), why does the RE miss the TGA at offset 11 when there is another TGA present at offset 18 (which it does match)? Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] | ||
by Athanasius (Archbishop) on Dec 10, 2015 at 09:54 UTC | |||
Hello AnomalousMonk, and thanks for the writeup. Is your Supplemental question meant rhetorically? Because I, for one, would really like to know the answer! Consider the following:
In this case the regex engine does not say: “Well, matching 1 f is better than matching none, so I’ll match the fC sequence first.” On the contrary, it first matches the C preceded by zero fs, as one might reasonably expect from the quantifier, *?, which says, “match this zero or more times in a non-greedy way.” I actually don’t understand how this can ever, logically, match with more than zero, since zero is possible and less greedy than 1?? But back to the OP, how does $dna =~ / (\w\w\w)*? TGA /gx differ logically from $s =~ / (f)*? C /gx? I can’t see a fundamental distinction here, yet there must be one because the former finds a non-empty match for (\w\w\w) before it looks for an empty one! Thanks,
| [reply] [d/l] [select] | ||
by choroba (Cardinal) on Dec 10, 2015 at 10:21 UTC | |||
how does $dna =~ / (\w\w\w)*? TGA /gx differ logically from $s =~ / (f)*? C /gxThe important difference here is the length of $1. After the first match (A is where the matching started, B denotes the position of the capture group)
the matching starts at B + 1. Zero times \w\w\w doesn't match here, we have xxTGAx, so the engine tries longer and longer strings, until it finds the TGA:
The next search will start at B + 1 again, and fail on xx. But, with the capture group of length 1, you always match the nearest group, because the (f)*? tries longer and longer strings. Maybe what's confusing here is that expanding the group by one character is similar to the engine advancing the starting position after a match failure?
| [reply] [d/l] [select] | ||
by Athanasius (Archbishop) on Dec 10, 2015 at 12:46 UTC | |||
by choroba (Cardinal) on Dec 10, 2015 at 13:02 UTC | |||
| |||
by AnomalousMonk (Archbishop) on Dec 10, 2015 at 22:22 UTC | |||
Is your Supplemental question meant rhetorically? It was meant rhetorically, but I'm glad you enjoyed it! ... my $s = 'abCdefC'; while ($s =~ / (f)*? C /gx) { ... } I think choroba has already well addressed the issues you raised in the paragraph following the one from which this is quoted, but let me try to address this one specifically — insofar as I understand what's going on and assuming I understand your question! In the code below, I think we're both happy that the (f)*? capture group acting before the first 'C' in the string is allowed not to match at all, and in that case the value of the capture variable ($1 in the code) is undef. I think we can agree that if the group expression were changed to (f*?) it would also match, capturing the empty string to $1. The second 'C' in the string is preceded by an 'f'. Why do both (f)*? and (f*?) capture the 'f' when they can be satisfied with nothing and need not be satisfied with anything more than nothing (i.e., they both do lazy matching)? Here's my story. If the RE matches nothing at offset 5 (the 'f'), it must then match a 'C' at offset 5, which is already occupied by an 'f', in order to satisfy the overall regex! The RE must first "consume" the 'f' at offset 5 before it can advance to match the 'C' at offset 6 for an overall match.
But here's a non-rhetorical question. In the code below, notice that there is a peculiar double-step at pos 3. The 'f' at offset 2 is first not captured (either as undef or as the empty string), then captured. I don't get it: a non-zero-width match is never a necessity for an overall match. Why not just step over the 'f' at offset 2 in the same way all the other characters are stepped over? (This code produces the same output under Strawberries 5.10, 5.12 and 5.14. However, when run under ActiveState 5.8.9, the output is the same except that $1 is always undef! I assume this is a bug that was fixed between 5.8.9 and 5.10.x, or maybe between AS and Strawberry.) Update: Consider also the second code example with a string of 'xxfffxx' for similar perplexity. Give a man a fish: <%-{-{-{-< | [reply] [d/l] [select] | ||
by choroba (Cardinal) on Dec 10, 2015 at 22:51 UTC | |||
by Anonymous Monk on Dec 09, 2015 at 22:39 UTC | |||
| [reply] | ||
by Discipulus (Canon) on Dec 10, 2015 at 11:12 UTC | |||
at offset 11 it is not to late to have 3 chars before TGA? L*
There are no rules, there are no thumbs.. Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS. | [reply] [d/l] | ||