in reply to Problem with a text-parsing regex

Here's one approach to solving the first problem: handling both "it's" and "will-o'-the-wisp":

( # $2: a "word" consisting of one or more o +f (?: [[:word:]] # a word character | # or hyphen, quote, or both # with word characters before and afte +r (?<= [[:word:]] ) (?: ' | - | '- | -' ) (?= [[:word:]] ) )+ )

For the double-hyphen, the easy solution is to replace it with space before parsing. The harder solution is to disallow it within the [[:punct:]]*, something like:

# any punctuation excluding "-" # or "-" that is neither preceded nor followed by itself (?: (?!-) [[:punct:]] | (?<!-) - (?!-) )*

With those two changes, I _think_ it passes all your test cases.

With a sufficiently recent perl, the experimental regex_sets feature should let you construct "any punctuation except hyphen" directly as a character class, which would be more efficient than /(?!-) [[:punct:]]/. I haven't yet worked out how to do that though - it's made harder by the special nature of '-' in character classes, doubly-special in char class arithmetic.

Replies are listed 'Best First'.
Re^2: Problem with a text-parsing regex (updated)
by AnomalousMonk (Archbishop) on May 07, 2022 at 22:46 UTC
    ... "any punctuation except hyphen" ...

    This can be expressed without experimental features by a "double-negative" character class trick:

    class of all characters that are [^-[:^punct:]] ^ ^ | | | +--- and also not a not-punct (i.e., or is a [:punct:]) | +--- not a hyphen
    Win8 Strawberry 5.8.9.5 (32) Sat 05/07/2022 18:36:51 C:\@Work\Perl\monks >perl use strict; use warnings; for my $char (split '', '#%-&*') { printf "'%s' %smatch \n", $char, $char =~ m{ \A [^-[:^punct:]] \z }xms ? '' : 'NO ' ; } ^Z '#' match '%' match '-' NO match '&' match '*' match
    See perlrecharclass.

    Update: The double-negative trick also works with "traditional" \s \d \w etc. character classes that have complements. E.g., the pattern "any word (\w) character except an underscore" can be defined as [^_\W].


    Give a man a fish:  <%-{-{-{-<

Re^2: Problem with a text-parsing regex
by ibm1620 (Chaplain) on May 07, 2022 at 21:50 UTC
    Thank you -- I think you've nailed it.

    I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use [[:word:]]+, for example. But there's probably no basis for that assumption. (Premature optimization!) That could make it easier in the future for me to tackle these complicated scenarios.

    I use v5.34.1, and will take a look at regex_sets.

      I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use [[:word:]]+, for example.

      It will be less efficient - but I would always recommend solving the problem first, and worrying about optimization second.

      In the general case, a regular expression that has to invoke more regops (regexp operations) will usually be slower than one that invokes fewer; but the cost will be less than invoking more ops at the perl level.