Re^3: How to enforce match priority irrespective of string position

Replies are listed 'Best First'.
Re^4: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 13:22 UTC
I sure was hoping someone would be able to suggest a regexp secret that I had not yet learned. I was hoping there would be some way of doing this. I may have to just pre-parse looking for the false positives, and exchange them temporarily for a marker of some sort before parsing a second time. I'm not even sure if that would work. I'll have to ponder that some more. I need to be able to reorder the sentences following a specific ruleset and in a specific order, by order of appearance in the sentence. Sigh. Too bad regex can't do everything! Blessings, ~Polyglot~	[reply]
Re^5: How to enforce match priority irrespective of string position by haukex (Archbishop) on Mar 07, 2021 at 14:12 UTC
I was hoping there would be some way of doing this. ... Sigh. Too bad regex can't do everything! Be aware of the "if all you have is a hammer, everything looks like a nail" effect. Doing everything in a single regex is nice, but shouldn't be a requirement - sometimes, things can be expressed much more cleanly with a few regexes and some code. And be aware of premature optimization as well - sure, oftentimes a single regex is faster than multiple, but usually it's better to get things working first instead of trying to bend over backwards and trying to wrap your head around a complex regex. Especially in the case you describe, IMHO the brainpower is much better spent on writing up test cases first! use warnings; use strict; use Test::More; sub my_sentence_splitter { my $input = shift; my @output; # ... magic ... return \@output; } is_deeply my_sentence_splitter(<<END), I'm looking for the end of a sentence, where possible. However, in so +me cases, I'll need to go with a non-conventional "end" to it, such a +s: "Here's a quote by a famous person which is supposed to exceed for +ty words and is therefore required to be set apart as a separate, ind +ented paragraph per APA style." (Famous, 1999) Note that the regex ne +eds to look for the full end of the sentence, if it exists: it cannot + simply stop at the colon unless there is no further part to the sent +ence provided in that paragraph. END [ q#I'm looking for the end of a sentence, where possible.#, q#However, in some cases, I'll need to go with a non-conventional +"end" to it, such as:#, q#"Here's a quote by a famous person which is supposed to exceed f +orty words and is therefore required to be set apart as a separate, i +ndented paragraph per APA style."#, q#(Famous, 1999)#, q#Note that the regex needs to look for the full end of the senten +ce, if it exists: it cannot simply stop at the colon unless there is +no further part to the sentence provided in that paragraph.#, ]; # TODO: Many more test cases here! done_testing; [download]	[reply] [d/l]
Re^6: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 14:31 UTC
I do appreciate all of your suggestions. In my case, speed is no issue. It's a one-off script that, once the job is done, will not be needed again. If it took all night or even three days to process, I wouldn't mind...so long as it was correctly completed (it should finish in just a few minutes, though). Furthermore, it's not running on English...which is one reason I gave a hypothetical example here. It's running on an Asian language, full of HTML-entity-style character codes which I'm converting to UTF8, among other things. Yes, I'm a polyglot. :) Blessings, ~Polyglot~	[reply]
Re^5: How to enforce match priority irrespective of string position by Anonymous Monk on Mar 07, 2021 at 13:34 UTC
Could you perhaps match repeatedly within the same string, in a loop, and then manually select what you consider to be the most appropriate match?	[reply]
Re^6: How to enforce match priority irrespective of string position by Polyglot (Chaplain) on Mar 07, 2021 at 14:11 UTC
Because of the complexity of the operation, I am actually matching a chunk at a time (it is this step where I've run into the "false positives" problem and need a priority-match solution) and then I am substituting via an evaluated subroutine which processes the captured chunk and returns the correct replacement. So, in a sense, I am doing this already--but in stages, via the subroutine. For the basic idea: `$str = s~[my regex] ~print "DEBUG: 3:$3; 4:$4; 5:$5; 6:$6\n"; $procfootnote->()~egmx;` [download] I'm no stranger to regex...but regex is sufficiently complex that I doubt I'll ever fully master it! Blessings, ~Polyglot~	[reply] [d/l]


Problems? Is your data what you think it is?
	PerlMonks