Re: Problem with a text-parsing regex

Here's one approach to solving the first problem: handling both "it's" and "will-o'-the-wisp":

        (                   # $2: a "word" consisting of one or more o
+f
            (?:
                [[:word:]]  #   a word character
            |               #   or hyphen, quote, or both
                            #     with word characters before and afte
+r
                (?<= [[:word:]] )
                (?: ' | - | '- | -' )
                (?= [[:word:]] )
            )+
        )
[download]

For the double-hyphen, the easy solution is to replace it with space before parsing. The harder solution is to disallow it within the [[:punct:]]*, something like:

  # any punctuation excluding "-"
  # or "-" that is neither preceded nor followed by itself
  (?: (?!-) [[:punct:]] | (?<!-) - (?!-) )*
[download]

With those two changes, I _think_ it passes all your test cases.

With a sufficiently recent perl, the experimental regex_sets feature should let you construct "any punctuation except hyphen" directly as a character class, which would be more efficient than /(?!-) [[:punct:]]/. I haven't yet worked out how to do that though - it's made harder by the special nature of '-' in character classes, doubly-special in char class arithmetic.

Comment on Re: Problem with a text-parsing regex Select or Download Code

Replies are listed 'Best First'.
Re^2: Problem with a text-parsing regex (updated) by AnomalousMonk (Archbishop) on May 07, 2022 at 22:46 UTC
... "any punctuation except hyphen" ... This can be expressed without experimental features by a "double-negative" character class trick: `class of all characters that are [^-[:^punct:]] ^ ^ \| \| \| +--- and also not a not-punct (i.e., or is a [:punct:]) \| +--- not a hyphen` [download] `Win8 Strawberry 5.8.9.5 (32) Sat 05/07/2022 18:36:51 C:\@Work\Perl\monks >perl use strict; use warnings; for my $char (split '', '#%-&') { printf "'%s' %smatch \n", $char, $char =~ m{ \A [^-[:^punct:]] \z }xms ? '' : 'NO ' ; } ^Z '#' match '%' match '-' NO match '&' match '' match` [download] See perlrecharclass. Update: The double-negative trick also works with "traditional" `\s \d \w` etc. character classes that have complements. E.g., the pattern "any word (`\w`) character except an underscore" can be defined as `[^_\W]`. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Problem with a text-parsing regex by ibm1620 (Chaplain) on May 07, 2022 at 21:50 UTC
Thank you -- I think you've nailed it. I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use `[[:word:]]+`, for example. But there's probably no basis for that assumption. (Premature optimization!) That could make it easier in the future for me to tackle these complicated scenarios. I use v5.34.1, and will take a look at regex_sets.	[reply] [d/l]
Re^3: Problem with a text-parsing regex by hv (Prior) on May 07, 2022 at 23:17 UTC
I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use `[[:word:]]+`, for example. It will be less efficient - but I would always recommend solving the problem first, and worrying about optimization second. In the general case, a regular expression that has to invoke more regops (regexp operations) will usually be slower than one that invokes fewer; but the cost will be less than invoking more ops at the perl level.	[reply] [d/l]