Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

According to the Perl documentation, Perl will always match at the earliest possible position within the string. This is true of any typical alternation, as can be seen here: https://perldoc.perl.org/perlrequick.

However, this is not what I need. I need to specify a priority of match without regard to position. An approximation of what I am needing is illustrated in the following (perhaps poor) example:

$line = qq~I'm looking for the end of a sentence, where possible. How +ever, in some cases, I'll need to go with a non-conventional "end" to + it, such as: "Here's a quote by a famous person which is supposed to exceed forty w +ords and is therefore required to be set apart as a separate, indente +d paragraph per APA style." (Famous, 1999) Note that the regex needs to look for the full end of the sentence, if + it exists: it cannot simply stop at the colon unless there is no fur +ther part to the sentence provided in that paragraph.~; $line =~ s/^ (.*?) ( (?:[.?!"]) #FIRST PRIORITY | (?:[:;-]) #SECOND PRIORITY | (?:\n|\r|\z|$) #LAST PRIORITY ) /<span class="s">$1$2</span>/gmx;

For the above, the desired sentence matches should be:

  1. I'm looking for the end of a sentence, where possible.
  2. However, in some cases, I'll need to go with a non-conventional "end" to it, such as:
  3. "Here's a quote by a famous person which is supposed to exceed forty words and is therefore required to be set apart as a separate, indented paragraph per APA style."
  4. (Famous, 1999)

  5. Note that the regex needs to look for the full end of the sentence, if it exists: it cannot simply stop at the colon unless there is no further part to the sentence provided in that paragraph.

As the example illustrates, the sentences should break at the first colon, but not at the second, as there is a higher-priority break-point, the period.

Is it possible to mandate a match priority such that the first one, irrespective of position, will be looked for first, and only upon failure would the next priority be sought, and so on? I have a case where my entire regex is failing on this issue, and I just cannot think of a good way to resolve it. It would not work, in my case, to use two separate regexes, as it would destroy the correcting ordering of the sequences matched.

Edit:

Perhaps this will be a better example/illustration.

 

Point 1.3.4: A piece of text.

Point 1.3.5: A piece of text.

Point 1.3.6: Another piece of text. Point 1.3.6: For some reason this piece of text isn't finished yet.

Point 1.3.6: In fact, this piece of text even broke into a new line.

Point 1.3.7: Finally, a new piece of text.

 

Now, it's easy to see that there are four points here. But the computer might not "see" four as it reads each of the "Point" notations. How could these points be captured such that each substitution will operate on the FULL point at once, not just a portion of a point? In other words, Point 1.3.6 needs to include three such notations spanning two separate lines.

I have coded it something like this:

$line =~ s~^ ( Point\s(\d+)\.(\d+)\.(d+) (.*?) ) (?= (?:Point\s (?:\d+)\.(?:\d+)\.(?!\4) ) #1 Priority | (?:\z|$) #2 Priority ) ~$processthis->()~egmx;

However, the #2 Priority match, because it matches first, ends up trumping the #1 Priority match, and any of the chunks of the form illustrated by Point 1.3.6 end up truncated.

Again, this is just an illustration, but perhaps it more clearly explains the priority issues. Moving the (.*?) into the forward-looking assertion(s), as some suggested I try, did not bring about the desired results for me.

Blessings,

~Polyglot~


In reply to How to enforce match priority irrespective of string position by Polyglot

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (4)
As of 2024-04-25 17:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found