in reply to Problem with a text-parsing regex
Here's one approach to solving the first problem: handling both "it's" and "will-o'-the-wisp":
( # $2: a "word" consisting of one or more o +f (?: [[:word:]] # a word character | # or hyphen, quote, or both # with word characters before and afte +r (?<= [[:word:]] ) (?: ' | - | '- | -' ) (?= [[:word:]] ) )+ )
For the double-hyphen, the easy solution is to replace it with space before parsing. The harder solution is to disallow it within the [[:punct:]]*, something like:
# any punctuation excluding "-" # or "-" that is neither preceded nor followed by itself (?: (?!-) [[:punct:]] | (?<!-) - (?!-) )*
With those two changes, I _think_ it passes all your test cases.
With a sufficiently recent perl, the experimental regex_sets feature should let you construct "any punctuation except hyphen" directly as a character class, which would be more efficient than /(?!-) [[:punct:]]/. I haven't yet worked out how to do that though - it's made harder by the special nature of '-' in character classes, doubly-special in char class arithmetic.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Problem with a text-parsing regex (updated)
by AnomalousMonk (Archbishop) on May 07, 2022 at 22:46 UTC | |
|
Re^2: Problem with a text-parsing regex
by ibm1620 (Chaplain) on May 07, 2022 at 21:50 UTC | |
by hv (Prior) on May 07, 2022 at 23:17 UTC |