comment on

According to the Perl documentation, Perl will always match at the earliest possible position within the string. This is true of any typical alternation, as can be seen here: https://perldoc.perl.org/perlrequick.

However, this is not what I need. I need to specify a priority of match without regard to position. An approximation of what I am needing is illustrated in the following (perhaps poor) example:

$line = qq~I'm looking for the end of a sentence, where possible.  How
+ever, in some cases, I'll need to go with a non-conventional "end" to
+ it, such as:

"Here's a quote by a famous person which is supposed to exceed forty w
+ords and is therefore required to be set apart as a separate, indente
+d paragraph per APA style." (Famous, 1999)

Note that the regex needs to look for the full end of the sentence, if
+ it exists: it cannot simply stop at the colon unless there is no fur
+ther part to the sentence provided in that paragraph.~;


$line =~ s/^
   (.*?)
   (
      (?:[.?!"])      #FIRST PRIORITY
       |
      (?:[:;-])      #SECOND PRIORITY
      |
      (?:\n|\r|\z|$)   #LAST PRIORITY
   )
   /<span class="s">$1$2</span>/gmx;
[download]

For the above, the desired sentence matches should be:

I'm looking for the end of a sentence, where possible.
However, in some cases, I'll need to go with a non-conventional "end" to it, such as:
"Here's a quote by a famous person which is supposed to exceed forty words and is therefore required to be set apart as a separate, indented paragraph per APA style."
(Famous, 1999)
Note that the regex needs to look for the full end of the sentence, if it exists: it cannot simply stop at the colon unless there is no further part to the sentence provided in that paragraph.

As the example illustrates, the sentences should break at the first colon, but not at the second, as there is a higher-priority break-point, the period.

Is it possible to mandate a match priority such that the first one, irrespective of position, will be looked for first, and only upon failure would the next priority be sought, and so on? I have a case where my entire regex is failing on this issue, and I just cannot think of a good way to resolve it. It would not work, in my case, to use two separate regexes, as it would destroy the correcting ordering of the sequences matched.

Edit:

Perhaps this will be a better example/illustration.

Point 1.3.4: A piece of text.

Point 1.3.5: A piece of text.

Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this piece of text isn't finished yet.

Point 1.3.6: In fact, this piece of text even broke into a new line.

Point 1.3.7: Finally, a new piece of text.

Now, it's easy to see that there are four points here. But the computer might not "see" four as it reads each of the "Point" notations. How could these points be captured such that each substitution will operate on the FULL point at once, not just a portion of a point? In other words, Point 1.3.6 needs to include three such notations spanning two separate lines.

I have coded it something like this:

$line =~ s~^
(
   Point\s(\d+)\.(\d+)\.(d+)
   (.*?)
)   
   (?=
      (?:Point\s
         (?:\d+)\.(?:\d+)\.(?!\4)
      )   #1 Priority
      |
      (?:\z|$)  #2 Priority
   )
  ~$processthis->()~egmx;
[download]

However, the #2 Priority match, because it matches first, ends up trumping the #1 Priority match, and any of the chunks of the form illustrated by Point 1.3.6 end up truncated.

Again, this is just an illustration, but perhaps it more clearly explains the priority issues. Moving the (.*?) into the forward-looking assertion(s), as some suggested I try, did not bring about the desired results for me.

Blessings,

~Polyglot~

In reply to How to enforce match priority irrespective of string position by Polyglot

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.