in reply to Re^2: perl indication of end of string already matched
in thread perl indication of end of string already matched

print 'pos at end' if pos $str == length $str;

This if statement is true whether or not the previous regex matched $.

I don't understand this. Can you give an example of a non-lookahead regex that matches to the end of a string and does not match at the end of the string, i.e. does not leave pos sitting beyond the end of the string (or pos == length)?

Using pos == length is sufficient. ... a simpler call ... avoid two calls ... and instead call one function ... I'm very sensitive to performance during parsing.

It sounds as if you may have an answer (even though I'm still a bit confused about the question). I imagine that Inline::C would allow you to define a single function to examine the internals of a string scalar and return info on pos versus length. Good luck :)


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^4: perl indication of end of string already matched
by nachumk (Initiate) on Jun 08, 2020 at 21:58 UTC
    I guess in it's most simplistic form, my question is whether the state that regex uses to determine whether $ was already matched is available to me? For example:
    my $str = "abc"; $str =~ m/.|$/g; # succeeds, and pos goes to 1 $str =~ m/.|$/g; # succeeds, and pos goes to 2 $str =~ m/.|$/g; # succeeds, and pos goes to 3 $str =~ m/.|$/g; # succeeds, and pos stays at 3 # Why does this behave + differently than the next regex? pos($str) is 3 for both calls $str =~ m/.|$/g; # fails and resets pos

    How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine. I'd like to know if that information is accessible.

    FYI, I rewrite my code constantly, this issue came up during one rewrite, I have to go back and see if there's a case where it is helpful. But the question still piqued my interest. I would like to understand if the internal structure that has this information ($ matched) is accessible

      How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine.

      It might be informative to look at re debugging (although the particular state you're asking about isn't displayed, unfortunately):

      use re 'debug'; my $str = "abc"; $str =~ m/.|$/g for 1..5; __END__ Compiling REx ".|$" Final program: 1: BRANCH (3) 2: REG_ANY (5) 3: BRANCH (FAIL) 4: SEOL (5) 5: END (0) minlen 0 Matching REx ".|$" against "abc" 0 <> <abc> | 0| 1:BRANCH(3) 0 <> <abc> | 1| 2:REG_ANY(5) 1 <a> <bc> | 1| 5:END(0) Match successful! Matching REx ".|$" against "bc" 1 <a> <bc> | 0| 1:BRANCH(3) 1 <a> <bc> | 1| 2:REG_ANY(5) 2 <ab> <c> | 1| 5:END(0) Match successful! Matching REx ".|$" against "c" 2 <ab> <c> | 0| 1:BRANCH(3) 2 <ab> <c> | 1| 2:REG_ANY(5) 3 <abc> <> | 1| 5:END(0) Match successful! Matching REx ".|$" against "" 3 <abc> <> | 0| 1:BRANCH(3) 3 <abc> <> | 1| 2:REG_ANY(5) | 1| failed... 3 <abc> <> | 0| 3:BRANCH(5) 3 <abc> <> | 1| 4:SEOL(5) 3 <abc> <> | 1| 5:END(0) Match successful! Matching REx ".|$" against "" 3 <abc> <> | 0| 1:BRANCH(3) 3 <abc> <> | 1| 2:REG_ANY(5) | 1| failed... 3 <abc> <> | 0| 3:BRANCH(5) 3 <abc> <> | 1| 4:SEOL(5) 3 <abc> <> | 1| 5:END(0) END: Match possible, but length=0 is smaller than requested=1, failing +! | 0| BRANCH failed... Match failed Freeing REx: ".|$"
      my question is whether the state that regex uses to determine whether $ was already matched is available to me? ... I would like to understand if the internal structure that has this information ($ matched) is accessible

      I have the same question as the others: Why? The regex engine seems to be DWIMing just fine...

      I don't know of a way to get at the internal information you're asking about. You can of course simply ask the engine to tell you what it matched, e.g.:

      my $str = "abc"; print $str =~ m/(.)|$/g ? "matched! ".($1//'$') : 'no match', "\n" for 1..5; __END__ matched! a matched! b matched! c matched! $ no match

      BTW: Another way to determine the start/middle/end position of a matched substring in a string is by using the  @- @+ regex special variables; see perlvar.

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'foo--bar--baz'; ;; while ($s =~ m{ (\w+) }xmsg) { printf qq{matched '$1' %s of '$s' \n}, $-[1] == 0 ? 'at start' : $+[1] == length $s ? 'at very end' : 'in middle'; } " matched 'foo' at start of 'foo--bar--baz' matched 'bar' in middle of 'foo--bar--baz' matched 'baz' at very end of 'foo--bar--baz'
      Play with adding '-'s at the start/end of  $s to convince yourself this works.


      Give a man a fish:  <%-{-{-{-<

        Good point, although in this case those variables also don't allow one to differentiate between the last two matches:

        $ perl -MData::Dump -e '$a="abc";for(1..6){ $a=~m/.|$/g; dd \@-,\@+ }' ([0], [1]) ([1], [2]) ([2], [3]) ([3], [3]) ([3], [3]) ([0], [1])

        Update: Hmm, well they can if one adds capture groups, which is kind of the same workaround as I showed here.

        $ perl -MData::Dump -e '$a="abc";for(1..6){$a=~m/(.|$)/g; dd \@-,\@+}' ([0, 0], [1, 1]) ([1, 1], [2, 2]) ([2, 2], [3, 3]) ([3, 3], [3, 3]) ([3], [3, 3]) ([0, 0], [1, 1])