How to enforce match priority irrespective of string position

Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

According to the Perl documentation, Perl will always match at the earliest possible position within the string. This is true of any typical alternation, as can be seen here: https://perldoc.perl.org/perlrequick.

However, this is not what I need. I need to specify a priority of match without regard to position. An approximation of what I am needing is illustrated in the following (perhaps poor) example:

$line = qq~I'm looking for the end of a sentence, where possible.  How
+ever, in some cases, I'll need to go with a non-conventional "end" to
+ it, such as:

"Here's a quote by a famous person which is supposed to exceed forty w
+ords and is therefore required to be set apart as a separate, indente
+d paragraph per APA style." (Famous, 1999)

Note that the regex needs to look for the full end of the sentence, if
+ it exists: it cannot simply stop at the colon unless there is no fur
+ther part to the sentence provided in that paragraph.~;


$line =~ s/^
   (.*?)
   (
      (?:[.?!"])      #FIRST PRIORITY
       |
      (?:[:;-])      #SECOND PRIORITY
      |
      (?:\n|\r|\z|$)   #LAST PRIORITY
   )
   /<span class="s">$1$2</span>/gmx;
[download]

For the above, the desired sentence matches should be:

I'm looking for the end of a sentence, where possible.
However, in some cases, I'll need to go with a non-conventional "end" to it, such as:
"Here's a quote by a famous person which is supposed to exceed forty words and is therefore required to be set apart as a separate, indented paragraph per APA style."
(Famous, 1999)
Note that the regex needs to look for the full end of the sentence, if it exists: it cannot simply stop at the colon unless there is no further part to the sentence provided in that paragraph.

As the example illustrates, the sentences should break at the first colon, but not at the second, as there is a higher-priority break-point, the period.

Is it possible to mandate a match priority such that the first one, irrespective of position, will be looked for first, and only upon failure would the next priority be sought, and so on? I have a case where my entire regex is failing on this issue, and I just cannot think of a good way to resolve it. It would not work, in my case, to use two separate regexes, as it would destroy the correcting ordering of the sequences matched.

Edit:

Perhaps this will be a better example/illustration.

Point 1.3.4: A piece of text.

Point 1.3.5: A piece of text.

Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this piece of text isn't finished yet.

Point 1.3.6: In fact, this piece of text even broke into a new line.

Point 1.3.7: Finally, a new piece of text.

Now, it's easy to see that there are four points here. But the computer might not "see" four as it reads each of the "Point" notations. How could these points be captured such that each substitution will operate on the FULL point at once, not just a portion of a point? In other words, Point 1.3.6 needs to include three such notations spanning two separate lines.

I have coded it something like this:

$line =~ s~^
(
   Point\s(\d+)\.(\d+)\.(d+)
   (.*?)
)   
   (?=
      (?:Point\s
         (?:\d+)\.(?:\d+)\.(?!\4)
      )   #1 Priority
      |
      (?:\z|$)  #2 Priority
   )
  ~$processthis->()~egmx;
[download]

However, the #2 Priority match, because it matches first, ends up trumping the #1 Priority match, and any of the chunks of the form illustrated by Point 1.3.6 end up truncated.

Again, this is just an illustration, but perhaps it more clearly explains the priority issues. Moving the (.*?) into the forward-looking assertion(s), as some suggested I try, did not bring about the desired results for me.

Blessings,

~Polyglot~

Comment on How to enforce match priority irrespective of string position Select or Download Code

Replies are listed 'Best First'.
Re: How to enforce match priority irrespective of string position by tybalt89 (Monsignor) on Mar 08, 2021 at 00:49 UTC
Finally, a "sort of" test case :) `#!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11129253 use warnings; local $_ = <<END; Point 1.3.4: A piece of text. Point 1.3.5: A piece of text. Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this +piece of text isn't finished yet. Point 1.3.6: In fact, this piece of text even broke into a new line. Point 1.3.7: Finally, a new piece of text. END my @parts; push @parts, $& while / (Point\s[\d.]+:) .*? (?=Point\|\z) (?!\1) /gsx; use Data::Dump 'dd'; dd \@parts;` [download] Outputs four chunks, just like you asked for: `[ "Point 1.3.4: A piece of text.\n\n", "Point 1.3.5: A piece of text.\n\n", "Point 1.3.6: Another piece of text. Point 1.3.6: For some reason th +is piece of text isn't finished yet.\n\nPoint 1.3.6: In fact, this pi +ece of text even broke into a new line.\n\n", "Point 1.3.7: Finally, a new piece of text.\n\n", ]` [download]	[reply] [d/l] [select]
Re^2: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 08, 2021 at 01:33 UTC
And that method worked! (Though I've had to restructure a bit to accommodate, as that was not in a simple substitution form.) I don't mind doing whatever is necessary to get things working, though...so thank you very much! I'll certainly upvote this when I get my next day's rations. This part seems to be the crucial bit: (?=Point\|\z) (?!\1). I find this sort of syntax confusing because it always seems to me that the "Point" here should have precedence over anything coming afterward in the regex sequence, in this case the "\1" backreference. If "Point" is already detected from the forward assertion, why can it be matched again (overlapped) by this reference, even if in the negative? Well, no complaints at the moment, certainly, as at least the script is now past this hurdle. Thank you. Blessings, ~Polyglot~	[reply]
Re^3: How to enforce match priority irrespective of string position by tybalt89 (Monsignor) on Mar 08, 2021 at 02:25 UTC
Because (?= and (?! are ZERO-WIDTH assertions.	[reply]
Re^4: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 08, 2021 at 02:37 UTC
Re^5: How to enforce match priority irrespective of string position by eyepopslikeamosquito (Archbishop) on Mar 08, 2021 at 03:30 UTC
Re: How to enforce match priority irrespective of string position by Takeshi Kovacs (Beadle) on Mar 07, 2021 at 12:10 UTC
I have trouble fully grasping your intention, especially because your example text and your description overlap. Could it be you are looking for recursive parsing, where anything in "quotes" won't be broken up at period? perldocs have examples for implementing this.	[reply]
Re^2: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 12:24 UTC
I am, of course, dealing with some exceptions in a body of text. The text has some irregularities, but could be parsed correctly if only I am able to impose a strict ordering of match priority. It isn't an issue of quotes, nor is nesting involved; it's actually an issue of some potential "false positives" that must be initially skipped in favor of a more favorable match unless that more favorable match cannot be found--in which case the "false positive" might be the correct match. Does this make sense? Blessings, ~Polyglot~	[reply]
Re^3: How to enforce match priority irrespective of string position by haukex (Archbishop) on Mar 07, 2021 at 12:59 UTC
I suspect it's likely you won't be able to do this with a simple regex. It sounds to me like you might want to start looking at parsers, such as the classic Parse::RecDescent, the regex-based Regexp::Grammars, or the relatively new Marpa::R2. Or as a middle ground, have a look at how Text::Sentence, Lingua::Sentence, and Lingua::EN::Sentence work internally.	[reply]
Re^4: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 13:22 UTC
Re^5: How to enforce match priority irrespective of string position by haukex (Archbishop) on Mar 07, 2021 at 14:12 UTC
Some notes below your chosen depth have not been shown here
Re^5: How to enforce match priority irrespective of string position by Anonymous Monk on Mar 07, 2021 at 13:34 UTC
Some notes below your chosen depth have not been shown here
Re^3: How to enforce match priority irrespective of string position by Takeshi Kovacs (Beadle) on Mar 07, 2021 at 12:36 UTC
I'd say use Hippo's template of an SSCCE Re: Matching a string in a parenthesized block (regex help) to write some tests for what you want and what you don't want. This would certainly be beneficial for you too. Other than that, \|-or conditions with swallowing can prioritize areas, like "quoted" ones. demo `DB<132> $_ = 'phrase. "phrase1.phrase2" phrase. phrase' 0 'phrase. "phrase1.phrase2" phrase. phrase' DB<133> split /(".*?"\|\.)/ 0 'phrase' 1 '.' 2 ' ' 3 '"phrase1.phrase2"' 4 ' phrase' 5 '.' 6 ' phrase' DB<134>` [download]	[reply] [d/l]
Re: How to enforce match priority irrespective of string position by jcb (Parson) on Mar 09, 2021 at 00:17 UTC
If you can advance incrementally through the text, you could try anchoring all of your patterns at pos with `\G`. Since `pos` is an lvalue, you could store the previous match position, try each pattern in priority order starting at the same previous position, take whichever match you prefer, store that into pos, and repeat for the next chunk. Something like: (untested) my $lastpos = 0; while ($lastpos < length $_) { my @matches = (undef x 3); pos = $lastpos; $matches[0] = pos if m/\G([^/?!"]+[.?!"])/gc; #FIRST PRIORITY pos = $lastpos; $matches[1] = pos if m/\G([^:;-]+[:;-])/gc; #SECOND PRIORITY pos = $lastpos; $matches[2] = pos if m/\G(.*(?:\n\|\r\|\z\|$))/gc; #LAST PRIORITY # somehow choose which match to use for the next cycle and set $last +pos here # substr $_, $lastpos, ($matches[$chosen] - $lastpos) # should yield the selected chunk between choosing a match and upda +ting $lastpos } [download]	[reply] [d/l] [select]
Re: How to enforce match priority irrespective of string position by rsFalse (Chaplain) on Mar 07, 2021 at 13:33 UTC
>> I need to specify a priority of match without regard to position. Try look-ahead search. May this sketch give some help: `$line =~ s/^ ( (?= .? $regex_1 ) .? $regex_1 #FIRST PRIORITY \| (?= .? $regex_2 ) .? $regex_2 #SECOND PRIORITY ) /something/gmx;` [download] Edit: removed text that caret is obsolete. Upd.: I think my example (now striked-thourgh) simply reduces to the same but without look-ahead; see comment by Lanx.	[reply] [d/l]
Re^2: How to enforce match priority irrespective of string position by LanX (Saint) on Mar 07, 2021 at 13:43 UTC
> `(?= .? $regex_1 ) .? $regex_1` does it make sense to repeat the regex? isn't it rather `(?= $re_cond_1 ) $re_match_1` ? Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^3: How to enforce match priority irrespective of string position by rsFalse (Chaplain) on Mar 07, 2021 at 13:58 UTC
Of course, `'$re_cond_1'` may be not equal to `'$re_match_1'`. But I wanted to show the simplest example. Further `'\1'` can be used to avoid self-repeating.	[reply] [d/l] [select]
Re^4: How to enforce match priority irrespective of string position by LanX (Saint) on Mar 07, 2021 at 14:05 UTC
Re^5: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 14:23 UTC
Re^5: How to enforce match priority irrespective of string position by rsFalse (Chaplain) on Mar 07, 2021 at 14:42 UTC
Re^3: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 14:02 UTC
I must say that this syntax confuses me. I am already using a lookahead to define the forward edge of the match (versus where the next match will start in the global substitution), and everything up to but not including that lookahead needs to be captured. I've never thought one could capture from a lookahead...but perhaps I'd misunderstood. I'm also using backslash lookaround assertions, because some of what is matched will be matched again (these are the false positives) and for an unpredictable number of times (fewer than 20). I tried putting rsFalse's suggestion to use but was unable to get the match to succeed. I don't think I understand it well enough. Blessings, ~Polyglot~	[reply]
Re^4: How to enforce match priority irrespective of string position by LanX (Saint) on Mar 07, 2021 at 14:43 UTC
Re^5: How to enforce match priority irrespective of string position by rsFalse (Chaplain) on Mar 07, 2021 at 15:01 UTC
Some notes below your chosen depth have not been shown here