pat_mc has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Perl Monks,

I am trying to control the matching position in regex matches. Ideally, I want all substring matches for a given match string:

$ perl -e 'while('aaa' =~/(aa)/g){print $-[-1], "\n"}' 0
Why am I only getting one match instead of two? I was expecting two matches, the first one in position 0 and the second one in position 1. Clearly, the regex engine continues matching at the end of the first match. Is there an elegant way (e.g., a regex modifier) to change the matching behaviour? Or do I really need to resort to the manual manipulation of the regex-internal matching position record?

Thanks in advance for taking the time to respond.

Cheers -

Pat
  • Comment on Controlling matching position in regexes to match all proper substrings
  • Download Code

Replies are listed 'Best First'.
Re: Controlling matching position in regexes to match all proper substrings
by roboticus (Chancellor) on Oct 04, 2014 at 14:01 UTC

    pat_mc:

    You might try using a zero-width positive lookahead assertion. I've not used them much, so there may be some gotchas in their general use. But for a simple usage like yours, it's not so bad:

    $ cat u.pl use strict; use warnings; while ('aaa' =~/a(?=a)/g) { print $-[-1], "\n" } $ perl u.pl 0 1

    Read about them in perldoc perlre in the "Look-Around Assertions" section.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: Controlling matching position in regexes to match all proper substrings
by AnomalousMonk (Archbishop) on Oct 04, 2014 at 16:20 UTC

    A couple other ways:

    c:\@Work\Perl\monks>perl -wMstrict -le "use 5.010; ;; print 'Perl version ', $]; ;; my $s = 'abcd'; ;; my @captures = $s =~ m{ (?= (\w\w)) }xmsg; printf qq{'$_' } for @captures; print ''; ;; local our @caps; ()= $s =~ m{ (?: (\w\w) (?{ push @caps, [ $^N, $-[1] ] }) (*F)) }xmsg +; print qq{captured '$_->[0]' at offset $_->[1]} for @caps; " Perl version 5.014004 'ab' 'bc' 'cd' captured 'ab' at offset 0 captured 'bc' at offset 1 captured 'cd' at offset 2
    Note that Perl version 5.10+ is needed only for the Special Backtracking Control Verbs  (*F) construct in the second example above. Also, the  (?{ code }) block in the second example must push to a package-global array because this construct doesn't work reliably with pad (my) variables (this was fixed in version 5.18 – I think). (Update: The first example above is very nice and neat if you only need to capture the (overlapping) matched substrings. If more info is needed, e.g., substring offset, one of the other, more messy approaches must be used.)

    Update: Another version of the second example can be had that does not depend on any 5.10 regex enhancement (the example below was run under 5.8.9), but it has the quirk that (for a reason I used to know but have since forgotten) all push-es to the array are duplicated! If you can live with that, then:

    c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "print 'Perl version ', $]; ;; my $s = 'abcd'; ;; local our @caps; ()= $s =~ m{ (?= (\w\w) (?{ push @caps, [ $^N, $-[1] ] })) }xmsg; dd \@caps; ;; my $skip; print qq{captured '$_->[0]' at offset $_->[1]} for grep $skip = !$skip, @caps; " Perl version 5.008009 [["ab", 0], ["ab", 0], ["bc", 1], ["bc", 1], ["cd", 2], ["cd", 2]] captured 'ab' at offset 0 captured 'bc' at offset 1 captured 'cd' at offset 2

    Update 2: Actually, in the regex of the
        ()= $s =~ m{ (?: (\w\w) (?{ push @caps, [ $^N, $-[1] ] }) (*F)) }xmsg;
    statement of the second example above, the outermost  (?: ... ) non-capturing grouping is completely useless. The
         m{ (\w\w) (?{ push @caps, [ $^N, $-[1] ] }) (*F) }xmsg
    regex works just as well.

Re: Controlling matching position in regexes to match all proper substrings
by AnomalousMonk (Archbishop) on Oct 04, 2014 at 17:16 UTC
    Why am I only getting one match instead of two? ... Clearly, the regex engine continues matching at the end of the first match.

    The RE does, indeed, normally continue matching after the end of a match. This can be neatly exemplified as below (works the same under 5.14):

    c:\@Work\Perl\monks>perl -wMstrict -le "print 'Perl version ', $]; ;; my $s = '123456'; ;; printf qq{'$_' } for $s =~ m{ (\d\d) }xmsg; print '' ;; printf qq{'$_' } for $s =~ m{ (?= (\d\d)) }xmsg; " Perl version 5.008009 '12' '34' '56' '12' '23' '34' '45' '56'

Re: Controlling matching position in regexes to match all proper substrings
by LanX (Saint) on Oct 04, 2014 at 14:00 UTC
    There is a anchor symbol to define where the search has to continue.

    IIRC it's \G, so /(a\Ga)/g should do.

    Have a look at perlre and perlretut .

    update

    Otherwise pos can be used in a while loop to read and set the next search start position.

    Just subtract the length of your pattern from pos and add 1 to continue.

    Cheers Rolf

    (addicted to the Perl Programming Language and ☆☆☆☆ :)

      I'm now at a box which runs Perl and can test:

      \G doesn't help, forgot about the limitations!¹

      But the pos approach works:

      DB<129> $_='12aaa67aaaa23' => "12aaa67aaaa23" DB<130> print pos($_) while ($_=~/(a\Ga)/g) DB<131> print pos($_),"\n" and pos($_)=pos($_)-1 while ($_=~/(aa)/g) 4 5 9 10 11

      HTH! :)

      Cheers Rolf

      (addicted to the Perl Programming Language and ☆☆☆☆ :)

      ¹) perlretut : the \G anchor is only fully supported when used to anchor to the start of the pattern.