in reply to Re^3: perl indication of end of string already matched
in thread perl indication of end of string already matched

I guess in it's most simplistic form, my question is whether the state that regex uses to determine whether $ was already matched is available to me? For example:
my $str = "abc"; $str =~ m/.|$/g; # succeeds, and pos goes to 1 $str =~ m/.|$/g; # succeeds, and pos goes to 2 $str =~ m/.|$/g; # succeeds, and pos goes to 3 $str =~ m/.|$/g; # succeeds, and pos stays at 3 # Why does this behave + differently than the next regex? pos($str) is 3 for both calls $str =~ m/.|$/g; # fails and resets pos

How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine. I'd like to know if that information is accessible.

FYI, I rewrite my code constantly, this issue came up during one rewrite, I have to go back and see if there's a case where it is helpful. But the question still piqued my interest. I would like to understand if the internal structure that has this information ($ matched) is accessible

Replies are listed 'Best First'.
Re^5: perl indication of end of string already matched
by haukex (Archbishop) on Jun 08, 2020 at 22:37 UTC
    How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine.

    It might be informative to look at re debugging (although the particular state you're asking about isn't displayed, unfortunately):

    use re 'debug'; my $str = "abc"; $str =~ m/.|$/g for 1..5; __END__ Compiling REx ".|$" Final program: 1: BRANCH (3) 2: REG_ANY (5) 3: BRANCH (FAIL) 4: SEOL (5) 5: END (0) minlen 0 Matching REx ".|$" against "abc" 0 <> <abc> | 0| 1:BRANCH(3) 0 <> <abc> | 1| 2:REG_ANY(5) 1 <a> <bc> | 1| 5:END(0) Match successful! Matching REx ".|$" against "bc" 1 <a> <bc> | 0| 1:BRANCH(3) 1 <a> <bc> | 1| 2:REG_ANY(5) 2 <ab> <c> | 1| 5:END(0) Match successful! Matching REx ".|$" against "c" 2 <ab> <c> | 0| 1:BRANCH(3) 2 <ab> <c> | 1| 2:REG_ANY(5) 3 <abc> <> | 1| 5:END(0) Match successful! Matching REx ".|$" against "" 3 <abc> <> | 0| 1:BRANCH(3) 3 <abc> <> | 1| 2:REG_ANY(5) | 1| failed... 3 <abc> <> | 0| 3:BRANCH(5) 3 <abc> <> | 1| 4:SEOL(5) 3 <abc> <> | 1| 5:END(0) Match successful! Matching REx ".|$" against "" 3 <abc> <> | 0| 1:BRANCH(3) 3 <abc> <> | 1| 2:REG_ANY(5) | 1| failed... 3 <abc> <> | 0| 3:BRANCH(5) 3 <abc> <> | 1| 4:SEOL(5) 3 <abc> <> | 1| 5:END(0) END: Match possible, but length=0 is smaller than requested=1, failing +! | 0| BRANCH failed... Match failed Freeing REx: ".|$"
    my question is whether the state that regex uses to determine whether $ was already matched is available to me? ... I would like to understand if the internal structure that has this information ($ matched) is accessible

    I have the same question as the others: Why? The regex engine seems to be DWIMing just fine...

    I don't know of a way to get at the internal information you're asking about. You can of course simply ask the engine to tell you what it matched, e.g.:

    my $str = "abc"; print $str =~ m/(.)|$/g ? "matched! ".($1//'$') : 'no match', "\n" for 1..5; __END__ matched! a matched! b matched! c matched! $ no match
Re^5: perl indication of end of string already matched
by AnomalousMonk (Archbishop) on Jun 09, 2020 at 00:40 UTC

    BTW: Another way to determine the start/middle/end position of a matched substring in a string is by using the  @- @+ regex special variables; see perlvar.

    c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'foo--bar--baz'; ;; while ($s =~ m{ (\w+) }xmsg) { printf qq{matched '$1' %s of '$s' \n}, $-[1] == 0 ? 'at start' : $+[1] == length $s ? 'at very end' : 'in middle'; } " matched 'foo' at start of 'foo--bar--baz' matched 'bar' in middle of 'foo--bar--baz' matched 'baz' at very end of 'foo--bar--baz'
    Play with adding '-'s at the start/end of  $s to convince yourself this works.


    Give a man a fish:  <%-{-{-{-<

      Good point, although in this case those variables also don't allow one to differentiate between the last two matches:

      $ perl -MData::Dump -e '$a="abc";for(1..6){ $a=~m/.|$/g; dd \@-,\@+ }' ([0], [1]) ([1], [2]) ([2], [3]) ([3], [3]) ([3], [3]) ([0], [1])

      Update: Hmm, well they can if one adds capture groups, which is kind of the same workaround as I showed here.

      $ perl -MData::Dump -e '$a="abc";for(1..6){$a=~m/(.|$)/g; dd \@-,\@+}' ([0, 0], [1, 1]) ([1, 1], [2, 2]) ([2, 2], [3, 3]) ([3, 3], [3, 3]) ([3], [3, 3]) ([0, 0], [1, 1])