in reply to Re:perl indication of end of string already matched
in thread perl indication of end of string already matched

print 'pos at end' if pos $str == length $str;

This if statement is true whether or not the previous regex matched $. And the regex engine will only match $ once. How does the regex engine know that it matched $? And can I get access to that information? I prefer to not call the regex engine again.

Using pos == length is sufficient. I was hoping there was a simpler call, something like pos() but for identifying whether $ was already matched. That would allow me to avoid two calls (length and pos) and instead call one function (perhaps eos($str)). I'm very sensitive to performance during parsing.

Replies are listed 'Best First'.
Re^3: perl indication of end of string already matched
by AnomalousMonk (Archbishop) on Jun 08, 2020 at 19:43 UTC
    print 'pos at end' if pos $str == length $str;

    This if statement is true whether or not the previous regex matched $.

    I don't understand this. Can you give an example of a non-lookahead regex that matches to the end of a string and does not match at the end of the string, i.e. does not leave pos sitting beyond the end of the string (or pos == length)?

    Using pos == length is sufficient. ... a simpler call ... avoid two calls ... and instead call one function ... I'm very sensitive to performance during parsing.

    It sounds as if you may have an answer (even though I'm still a bit confused about the question). I imagine that Inline::C would allow you to define a single function to examine the internals of a string scalar and return info on pos versus length. Good luck :)


    Give a man a fish:  <%-{-{-{-<

      I guess in it's most simplistic form, my question is whether the state that regex uses to determine whether $ was already matched is available to me? For example:
      my $str = "abc"; $str =~ m/.|$/g; # succeeds, and pos goes to 1 $str =~ m/.|$/g; # succeeds, and pos goes to 2 $str =~ m/.|$/g; # succeeds, and pos goes to 3 $str =~ m/.|$/g; # succeeds, and pos stays at 3 # Why does this behave + differently than the next regex? pos($str) is 3 for both calls $str =~ m/.|$/g; # fails and resets pos

      How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine. I'd like to know if that information is accessible.

      FYI, I rewrite my code constantly, this issue came up during one rewrite, I have to go back and see if there's a case where it is helpful. But the question still piqued my interest. I would like to understand if the internal structure that has this information ($ matched) is accessible

        How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine.

        It might be informative to look at re debugging (although the particular state you're asking about isn't displayed, unfortunately):

        use re 'debug'; my $str = "abc"; $str =~ m/.|$/g for 1..5; __END__ Compiling REx ".|$" Final program: 1: BRANCH (3) 2: REG_ANY (5) 3: BRANCH (FAIL) 4: SEOL (5) 5: END (0) minlen 0 Matching REx ".|$" against "abc" 0 <> <abc> | 0| 1:BRANCH(3) 0 <> <abc> | 1| 2:REG_ANY(5) 1 <a> <bc> | 1| 5:END(0) Match successful! Matching REx ".|$" against "bc" 1 <a> <bc> | 0| 1:BRANCH(3) 1 <a> <bc> | 1| 2:REG_ANY(5) 2 <ab> <c> | 1| 5:END(0) Match successful! Matching REx ".|$" against "c" 2 <ab> <c> | 0| 1:BRANCH(3) 2 <ab> <c> | 1| 2:REG_ANY(5) 3 <abc> <> | 1| 5:END(0) Match successful! Matching REx ".|$" against "" 3 <abc> <> | 0| 1:BRANCH(3) 3 <abc> <> | 1| 2:REG_ANY(5) | 1| failed... 3 <abc> <> | 0| 3:BRANCH(5) 3 <abc> <> | 1| 4:SEOL(5) 3 <abc> <> | 1| 5:END(0) Match successful! Matching REx ".|$" against "" 3 <abc> <> | 0| 1:BRANCH(3) 3 <abc> <> | 1| 2:REG_ANY(5) | 1| failed... 3 <abc> <> | 0| 3:BRANCH(5) 3 <abc> <> | 1| 4:SEOL(5) 3 <abc> <> | 1| 5:END(0) END: Match possible, but length=0 is smaller than requested=1, failing +! | 0| BRANCH failed... Match failed Freeing REx: ".|$"
        my question is whether the state that regex uses to determine whether $ was already matched is available to me? ... I would like to understand if the internal structure that has this information ($ matched) is accessible

        I have the same question as the others: Why? The regex engine seems to be DWIMing just fine...

        I don't know of a way to get at the internal information you're asking about. You can of course simply ask the engine to tell you what it matched, e.g.:

        my $str = "abc"; print $str =~ m/(.)|$/g ? "matched! ".($1//'$') : 'no match', "\n" for 1..5; __END__ matched! a matched! b matched! c matched! $ no match

        BTW: Another way to determine the start/middle/end position of a matched substring in a string is by using the  @- @+ regex special variables; see perlvar.

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'foo--bar--baz'; ;; while ($s =~ m{ (\w+) }xmsg) { printf qq{matched '$1' %s of '$s' \n}, $-[1] == 0 ? 'at start' : $+[1] == length $s ? 'at very end' : 'in middle'; } " matched 'foo' at start of 'foo--bar--baz' matched 'bar' in middle of 'foo--bar--baz' matched 'baz' at very end of 'foo--bar--baz'
        Play with adding '-'s at the start/end of  $s to convince yourself this works.


        Give a man a fish:  <%-{-{-{-<

Re^3: perl indication of end of string already matched
by rsFalse (Chaplain) on Jun 16, 2020 at 11:31 UTC
    >>How does the regex engine know that it matched $?

    Hi.

    I think engine doesn't know. It knows e.g. that it matched a zero-length branch once. And it cancels to match second time the same place in order to avoid eternal matching.
    I believe you can get similar results with regexes like these: m/.|(?:)/gc, m/.|(?=)|(?:)/gc...