Re^2: perl indication of end of string already matched

Replies are listed 'Best First'.
Re^3: perl indication of end of string already matched by AnomalousMonk (Archbishop) on Jun 08, 2020 at 19:43 UTC
`print 'pos at end' if pos $str == length $str;` This if statement is true whether or not the previous regex matched $. I don't understand this. Can you give an example of a non-lookahead regex that matches to the end of a string and does not match at the end of the string, i.e. does not leave pos sitting beyond the end of the string (or pos == length)? Using pos == length is sufficient. ... a simpler call ... avoid two calls ... and instead call one function ... I'm very sensitive to performance during parsing. It sounds as if you may have an answer (even though I'm still a bit confused about the question). I imagine that Inline::C would allow you to define a single function to examine the internals of a string scalar and return info on `pos` versus `length`. Good luck :) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^4: perl indication of end of string already matched by nachumk (Initiate) on Jun 08, 2020 at 21:58 UTC
I guess in it's most simplistic form, my question is whether the state that regex uses to determine whether $ was already matched is available to me? For example: `my $str = "abc"; $str =~ m/.\|$/g; # succeeds, and pos goes to 1 $str =~ m/.\|$/g; # succeeds, and pos goes to 2 $str =~ m/.\|$/g; # succeeds, and pos goes to 3 $str =~ m/.\|$/g; # succeeds, and pos stays at 3 # Why does this behave + differently than the next regex? pos($str) is 3 for both calls $str =~ m/.\|$/g; # fails and resets pos` [download] How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine. I'd like to know if that information is accessible. FYI, I rewrite my code constantly, this issue came up during one rewrite, I have to go back and see if there's a case where it is helpful. But the question still piqued my interest. I would like to understand if the internal structure that has this information ($ matched) is accessible	[reply] [d/l]
Re^5: perl indication of end of string already matched by haukex (Archbishop) on Jun 08, 2020 at 22:37 UTC
How does the engine know that it has already succeeded with the fourth regex (when pos is already 3)? There must be some mechanism that records that internally in the regex engine. It might be informative to look at re debugging (although the particular state you're asking about isn't displayed, unfortunately): use re 'debug'; my $str = "abc"; $str =~ m/.\|$/g for 1..5; __END__ Compiling REx ".\|$" Final program: 1: BRANCH (3) 2: REG_ANY (5) 3: BRANCH (FAIL) 4: SEOL (5) 5: END (0) minlen 0 Matching REx ".\|$" against "abc" 0 <> <abc> \| 0\| 1:BRANCH(3) 0 <> <abc> \| 1\| 2:REG_ANY(5) 1 <a> <bc> \| 1\| 5:END(0) Match successful! Matching REx ".\|$" against "bc" 1 <a> <bc> \| 0\| 1:BRANCH(3) 1 <a> <bc> \| 1\| 2:REG_ANY(5) 2 <ab> <c> \| 1\| 5:END(0) Match successful! Matching REx ".\|$" against "c" 2 <ab> <c> \| 0\| 1:BRANCH(3) 2 <ab> <c> \| 1\| 2:REG_ANY(5) 3 <abc> <> \| 1\| 5:END(0) Match successful! Matching REx ".\|$" against "" 3 <abc> <> \| 0\| 1:BRANCH(3) 3 <abc> <> \| 1\| 2:REG_ANY(5) \| 1\| failed... 3 <abc> <> \| 0\| 3:BRANCH(5) 3 <abc> <> \| 1\| 4:SEOL(5) 3 <abc> <> \| 1\| 5:END(0) Match successful! Matching REx ".\|$" against "" 3 <abc> <> \| 0\| 1:BRANCH(3) 3 <abc> <> \| 1\| 2:REG_ANY(5) \| 1\| failed... 3 <abc> <> \| 0\| 3:BRANCH(5) 3 <abc> <> \| 1\| 4:SEOL(5) 3 <abc> <> \| 1\| 5:END(0) END: Match possible, but length=0 is smaller than requested=1, failing +! \| 0\| BRANCH failed... Match failed Freeing REx: ".\|$" [download] my question is whether the state that regex uses to determine whether $ was already matched is available to me? ... I would like to understand if the internal structure that has this information ($ matched) is accessible I have the same question as the others: Why? The regex engine seems to be DWIMing just fine... I don't know of a way to get at the internal information you're asking about. You can of course simply ask the engine to tell you what it matched, e.g.: `my $str = "abc"; print $str =~ m/(.)\|$/g ? "matched! ".($1//'$') : 'no match', "\n" for 1..5; __END__ matched! a matched! b matched! c matched! $ no match` [download]	[reply] [d/l] [select]
Re^5: perl indication of end of string already matched by AnomalousMonk (Archbishop) on Jun 09, 2020 at 00:40 UTC
BTW: Another way to determine the start/middle/end position of a matched substring in a string is by using the `@- @+` regex special variables; see perlvar. `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = 'foo--bar--baz'; ;; while ($s =~ m{ (\w+) }xmsg) { printf qq{matched '$1' %s of '$s' \n}, $-[1] == 0 ? 'at start' : $+[1] == length $s ? 'at very end' : 'in middle'; } " matched 'foo' at start of 'foo--bar--baz' matched 'bar' in middle of 'foo--bar--baz' matched 'baz' at very end of 'foo--bar--baz'` [download] Play with adding `'-'`s at the start/end of `$s` to convince yourself this works. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^6: perl indication of end of string already matched (updated) by haukex (Archbishop) on Jun 09, 2020 at 08:29 UTC
Re^3: perl indication of end of string already matched by rsFalse (Chaplain) on Jun 16, 2020 at 11:31 UTC
>>How does the regex engine know that it matched $? Hi. I think engine doesn't know. It knows e.g. that it matched a zero-length branch once. And it cancels to match second time the same place in order to avoid eternal matching. I believe you can get similar results with regexes like these: `m/.\|(?:)/gc`, `m/.\|(?=)\|(?:)/gc`...	[reply] [d/l] [select]