Here's one approach to solving the first problem: handling both "it's" and "will-o'-the-wisp":
( # $2: a "word" consisting of one or more o +f (?: [[:word:]] # a word character | # or hyphen, quote, or both # with word characters before and afte +r (?<= [[:word:]] ) (?: ' | - | '- | -' ) (?= [[:word:]] ) )+ )
For the double-hyphen, the easy solution is to replace it with space before parsing. The harder solution is to disallow it within the [[:punct:]]*, something like:
# any punctuation excluding "-" # or "-" that is neither preceded nor followed by itself (?: (?!-) [[:punct:]] | (?<!-) - (?!-) )*
With those two changes, I _think_ it passes all your test cases.
With a sufficiently recent perl, the experimental regex_sets feature should let you construct "any punctuation except hyphen" directly as a character class, which would be more efficient than /(?!-) [[:punct:]]/. I haven't yet worked out how to do that though - it's made harder by the special nature of '-' in character classes, doubly-special in char class arithmetic.
In reply to Re: Problem with a text-parsing regex
by hv
in thread Problem with a text-parsing regex
by ibm1620
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |