Polyglot has asked for the wisdom of the Perl Monks concerning the following question:
According to the Perl documentation, Perl will always match at the earliest possible position within the string. This is true of any typical alternation, as can be seen here: https://perldoc.perl.org/perlrequick.
However, this is not what I need. I need to specify a priority of match without regard to position. An approximation of what I am needing is illustrated in the following (perhaps poor) example:
$line = qq~I'm looking for the end of a sentence, where possible. How +ever, in some cases, I'll need to go with a non-conventional "end" to + it, such as: "Here's a quote by a famous person which is supposed to exceed forty w +ords and is therefore required to be set apart as a separate, indente +d paragraph per APA style." (Famous, 1999) Note that the regex needs to look for the full end of the sentence, if + it exists: it cannot simply stop at the colon unless there is no fur +ther part to the sentence provided in that paragraph.~; $line =~ s/^ (.*?) ( (?:[.?!"]) #FIRST PRIORITY | (?:[:;-]) #SECOND PRIORITY | (?:\n|\r|\z|$) #LAST PRIORITY ) /<span class="s">$1$2</span>/gmx;
For the above, the desired sentence matches should be:
As the example illustrates, the sentences should break at the first colon, but not at the second, as there is a higher-priority break-point, the period.
Is it possible to mandate a match priority such that the first one, irrespective of position, will be looked for first, and only upon failure would the next priority be sought, and so on? I have a case where my entire regex is failing on this issue, and I just cannot think of a good way to resolve it. It would not work, in my case, to use two separate regexes, as it would destroy the correcting ordering of the sequences matched.
Edit:
Perhaps this will be a better example/illustration.
Point 1.3.4: A piece of text.
Point 1.3.5: A piece of text.
Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this piece of text isn't finished yet.
Point 1.3.6: In fact, this piece of text even broke into a new line.
Point 1.3.7: Finally, a new piece of text.
Now, it's easy to see that there are four points here. But the computer might not "see" four as it reads each of the "Point" notations. How could these points be captured such that each substitution will operate on the FULL point at once, not just a portion of a point? In other words, Point 1.3.6 needs to include three such notations spanning two separate lines.
I have coded it something like this:
$line =~ s~^ ( Point\s(\d+)\.(\d+)\.(d+) (.*?) ) (?= (?:Point\s (?:\d+)\.(?:\d+)\.(?!\4) ) #1 Priority | (?:\z|$) #2 Priority ) ~$processthis->()~egmx;
However, the #2 Priority match, because it matches first, ends up trumping the #1 Priority match, and any of the chunks of the form illustrated by Point 1.3.6 end up truncated.
Again, this is just an illustration, but perhaps it more clearly explains the priority issues. Moving the (.*?) into the forward-looking assertion(s), as some suggested I try, did not bring about the desired results for me.
Blessings,
~Polyglot~
|
|---|