in reply to Re^8: Understanding a portion of perlretut
in thread Understanding a portion on the Perlretut

Hello choroba,

Thanks for the explanation, and I’m sorry to be obtuse but — I still don’t understand. :-( Consider:

#! perl -l use strict; use warnings; my $s = 'uvXYZdabcXYZfg'; while ($s =~ /(\w\w\w)*?(XYZ)/g) { print 'Found match ', $1, $2, ' at pos: ', pos $s; } print '-----'; while ($s =~ /(abc)*?(XYZ)/g) { print 'Found match ', $1, $2, ' at pos: ', pos $s; }

Output:

22:35 >perl 1476_SoPW.pl Found match abcXYZ at pos: 12 ----- Use of uninitialized value $1 in print at 1476_SoPW.pl line 28. Found match XYZ at pos: 5 Found match abcXYZ at pos: 12 22:35 >

The first capture in each regex is 3 characters wide, but the first regex matches only the second occurrence of XYZ whereas the second regex also matches the first occurrence, with (abc)*? matching zero times. Why the difference in behaviour? In particular, why does (\w\w\w)*? not also match zero times?

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

Replies are listed 'Best First'.
Re^10: Understanding a portion of perlretut
by choroba (Cardinal) on Dec 10, 2015 at 13:02 UTC
    In the first case, the engine starts from the left and finds a match, repeating (\w\w\w) 3 times:
    uvXYZdabcXYZfg ^ ^ | | A B

    The engine than starts to match at B + 1, and finds no such a match.

    In the second case, the engine starts from the left as well, but finds no match:

    uvXYZdabcXYZfg ^ | A

    So, it moves to A + 1 (still no match), and then A + 2, where it can match with (abc)* repeating zero times:

    uvXYZdabcXYZfg ^ | A=B

    After matching, it continues (because of /g) to B + 3 (no match), and at B + 4 it finally succeeds with

    uvXYZdabcXYZfg ^ ^ | | A B

    Better now?

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Yes, thanks, I’m starting to see some light. :-) But:

      In the first case, the engine starts from the left and finds a match, repeating (\w\w\w) 3 times:

      Makes sense, but then, if the $1 match effectively ends up as (\w\w\w){3}, shouldn’t $1 contain uvXYZdabc? Or, conversely, if the ? quantifier following the * causes it to find the minimum string in uvXYZdabc matching (\w\w\w)*, shouldn’t that be "", the null string? Why does $1 match abc here?

      Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

        The repeated capturing overwrites the value in $1. If you want the repetition inside of a capture variable, you'll have to use something like ((?:\w\w\w)*?).