Re^8: Understanding a portion of perlretut

how does $dna =~ / (\w\w\w)*? TGA /gx differ logically from $s =~ / (f)*? C /gx

The important difference here is the length of $1.

After the first match (A is where the matching started, B denotes the position of the capture group)

XXXxxxTGAxxTGAxxxxTGAxx
^     ^
|     |
A     B
[download]

the matching starts at B + 1. Zero times \w\w\w doesn't match here, we have xxTGAx, so the engine tries longer and longer strings, until it finds the TGA:

XXXxxxTGAxxTGAxxxxTGAxx
         ^        ^
         |        |
         A        B
[download]

The next search will start at B + 1 again, and fail on xx.

But, with the capture group of length 1, you always match the nearest group, because the (f)*? tries longer and longer strings. Maybe what's confusing here is that expanding the group by one character is similar to the engine advancing the starting position after a match failure?

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
[download]

Comment on Re^8: Understanding a portion of perlretut Select or Download Code

Replies are listed 'Best First'.

Re^9: Understanding a portion of perlretut
by Athanasius (Archbishop) on Dec 10, 2015 at 12:46 UTC

Hello choroba,

Thanks for the explanation, and I’m sorry to be obtuse but — I still don’t understand. :-( Consider:

#! perl -l
use strict;
use warnings;

my $s = 'uvXYZdabcXYZfg';

while ($s =~ /(\w\w\w)*?(XYZ)/g)
{
    print 'Found match ', $1, $2, ' at pos: ', pos $s;
}

print '-----';

while ($s =~ /(abc)*?(XYZ)/g)
{
    print 'Found match ', $1, $2, ' at pos: ', pos $s;
}
[download]

Output:

22:35 >perl 1476_SoPW.pl
Found match abcXYZ at pos: 12
-----
Use of uninitialized value $1 in print at 1476_SoPW.pl line 28.
Found match XYZ at pos: 5
Found match abcXYZ at pos: 12

22:35 >
[download]

The first capture in each regex is 3 characters wide, but the first regex matches only the second occurrence of XYZ whereas the second regex also matches the first occurrence, with (abc)*? matching zero times. Why the difference in behaviour? In particular, why does (\w\w\w)*? not also match zero times?

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^10: Understanding a portion of perlretut

by choroba (Cardinal) on Dec 10, 2015 at 13:02 UTC

uvXYZdabcXYZfg
^        ^   
|        |
A        B
[download]

The engine than starts to match at B + 1, and finds no such a match.

In the second case, the engine starts from the left as well, but finds no match:

uvXYZdabcXYZfg
^
|
A
[download]

So, it moves to A + 1 (still no match), and then A + 2, where it can match with (abc)* repeating zero times:

uvXYZdabcXYZfg
  ^
  |
 A=B
[download]

After matching, it continues (because of /g) to B + 3 (no match), and at B + 4 it finally succeeds with

uvXYZdabcXYZfg
      ^  ^
      |  |
      A  B
[download]

Better now?

($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord
}map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
[download]

[reply]
[d/l]
[select]

Re^11: Understanding a portion of perlretut

by Athanasius (Archbishop) on Dec 10, 2015 at 13:27 UTC

Yes, thanks, I’m starting to see some light. :-) But:

In the first case, the engine starts from the left and finds a match, repeating (\w\w\w) 3 times:

Makes sense, but then, if the $1 match effectively ends up as (\w\w\w){3}, shouldn’t $1 contain uvXYZdabc? Or, conversely, if the ? quantifier following the * causes it to find the minimum string in uvXYZdabc matching (\w\w\w)*, shouldn’t that be "", the null string? Why does $1 match abc here?

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re^12: Understanding a portion of perlretut

by Corion (Patriarch) on Dec 10, 2015 at 13:30 UTC


Perl Monk, Perl Meditation
	PerlMonks