Problem with a text-parsing regex

ibm1620 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm having difficulty with a regexp to split English text into the sort of elements I need.

Original plan was to chop up lines of text into whitespace-separated chunks, and separate out leading and trailing punctuation into separate variables, producing three values: $pre, $word, and $post. $post's final character would be the whitespace character separating it from the next chunk.

Several complications: I want to allow a "word" to be a hyphenated term (two-fer, Bob's-yer-uncle, will-o'-the-wisp); I want to allow embedded apostrophes (o'clock, it's); and I want to treat two or more hyphens in a row as equivalent to a whitespace character that separates the chunks.

The following almost works the way I want it to. I've noted where it fails. I can generally see what causes a failure, but fixing it always breaks something else.

As always, thanks for your generous help!

#!/usr/bin/env perl

use 5.010;
use warnings;
use strict;

my $n; # line no
while (my $x = <DATA>) {
    chomp $x;
    say $x;
    while (
        $x =~ m/
        ([[:punct:]]*)      # $1: leading punct marks
        (                   # $2: a "word" consisting of
            (?: [[:word:]']+ - )* # optional segments with
                                  # embedded {'}s ending with
                                  # single {-}
            [[:word:]]+     # and ending in pure word characters
        )
        ([[:punct:]]* \ ? )   # $3: trailing punct marks ending
                              # with space (except at end of
                              # line?)
        /xxg
    )
    {
        printf "  %3s {%s|%s|%s}\n", ++$n,
            # make whitespace visible
            map {(my $y = $_ // '') =~ tr/ /_/; $y}  $1, $2, $3;
    }
}
__DATA__
"'Uncouth' about sums it up."
The word they will use is 'uncouth'.
"It's the old story."
It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc
+k.
It's two o'clock--time for a nap.
Remember 45's?
What about (this)?
[Editor's note: blah blah] and so on...
A ... and B
I said--"What's the expression?"
[download]

Output:

"'Uncouth' about sums it up."
    1 {"'|Uncouth|'_}
    2 {|about|_}
    3 {|sums|_}
    4 {|it|_}
    5 {|up|."}
The word they will use is 'uncouth'.
    6 {|The|_}
    7 {|word|_}
    8 {|they|_}
    9 {|will|_}
   10 {|use|_}
   11 {|is|_}
   12 {'|uncouth|'.}
"It's the old story."
   13 {"|It|'}                    <- should be {"|It's|_}
   14 {|s|_}
   15 {|the|_}
   16 {|old|_}
   17 {|story|."}
It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc
+k.
   18 {|It|'}                              <- same problem
   19 {|s|_}
   20 {|a|_}
   21 {|will-o'-the-wisp|--}     <- perfect!
   22 {|a|_}
   23 {|two-fer|--}
   24 {|and|_}
   25 {|Bob's-yer-uncle|_}
   26 {|at|_}
   27 {|four|_}
   28 {|o|'}                            <- should be {|o'clock|.}
   29 {|clock|.}
It's two o'clock--time for a nap.
   30 {|It|'}
   31 {|s|_}
   32 {|two|_}
   33 {|o|'}                            <- should be {|o'clock|--}
   34 {|clock|--}
   35 {|time|_}
   36 {|for|_}
   37 {|a|_}
   38 {|nap|.}
Remember 45's?
   39 {|Remember|_}
   40 {|45|'}                                 <-
   41 {|s|?}
What about (this)?
   42 {|What|_}
   43 {|about|_}
   44 {(|this|)?}
[Editor's note: blah blah] and so on...
   45 {[|Editor|'}                           <-
   46 {|s|_}
   47 {|note|:_}
   48 {|blah|_}
   49 {|blah|]_}
   50 {|and|_}
   51 {|so|_}
   52 {|on|...}
A ... and B
   53 {|A|_}           <- correct to omit detached elipsis
   54 {|and|_}
   55 {|B|}
I said--"What's the expression?"
   56 {|I|_}
   57 {|said|--"}             <- should be {|said|--}
   58 {|What|'}               <- should be {"|What's|_}
   59 {|s|_}
   60 {|the|_}
   61 {|expression|?"}
[download]

Comment on Problem with a text-parsing regex Select or Download Code

Replies are listed 'Best First'.
Re: Problem with a text-parsing regex by hv (Prior) on May 07, 2022 at 20:13 UTC
Here's one approach to solving the first problem: handling both "it's" and "will-o'-the-wisp": `( # $2: a "word" consisting of one or more o +f (?: [[:word:]] # a word character \| # or hyphen, quote, or both # with word characters before and afte +r (?<= [[:word:]] ) (?: ' \| - \| '- \| -' ) (?= [[:word:]] ) )+ )` [download] For the double-hyphen, the easy solution is to replace it with space before parsing. The harder solution is to disallow it within the `[[:punct:]]`, something like: `# any punctuation excluding "-" # or "-" that is neither preceded nor followed by itself (?: (?!-) [[:punct:]] \| (?<!-) - (?!-) )` [download] With those two changes, I _think_ it passes all your test cases. With a sufficiently recent perl, the experimental regex_sets feature should let you construct "any punctuation except hyphen" directly as a character class, which would be more efficient than `/(?!-) [[:punct:]]/`. I haven't yet worked out how to do that though - it's made harder by the special nature of '-' in character classes, doubly-special in char class arithmetic.	[reply] [d/l] [select]
Re^2: Problem with a text-parsing regex (updated) by AnomalousMonk (Archbishop) on May 07, 2022 at 22:46 UTC
... "any punctuation except hyphen" ... This can be expressed without experimental features by a "double-negative" character class trick: `class of all characters that are [^-[:^punct:]] ^ ^ \| \| \| +--- and also not a not-punct (i.e., or is a [:punct:]) \| +--- not a hyphen` [download] `Win8 Strawberry 5.8.9.5 (32) Sat 05/07/2022 18:36:51 C:\@Work\Perl\monks >perl use strict; use warnings; for my $char (split '', '#%-&') { printf "'%s' %smatch \n", $char, $char =~ m{ \A [^-[:^punct:]] \z }xms ? '' : 'NO ' ; } ^Z '#' match '%' match '-' NO match '&' match '' match` [download] See perlrecharclass. Update: The double-negative trick also works with "traditional" `\s \d \w` etc. character classes that have complements. E.g., the pattern "any word (`\w`) character except an underscore" can be defined as `[^_\W]`. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Problem with a text-parsing regex by ibm1620 (Hermit) on May 07, 2022 at 21:50 UTC
Thank you -- I think you've nailed it. I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use `[[:word:]]+`, for example. But there's probably no basis for that assumption. (Premature optimization!) That could make it easier in the future for me to tackle these complicated scenarios. I use v5.34.1, and will take a look at regex_sets.	[reply] [d/l]
Re^3: Problem with a text-parsing regex by hv (Prior) on May 07, 2022 at 23:17 UTC
I'd never thought about using a character-at-a-time approach as you did to handle the first problem. I just assumed it would be much less efficient than trying to use `[[:word:]]+`, for example. It will be less efficient - but I would always recommend solving the problem first, and worrying about optimization second. In the general case, a regular expression that has to invoke more regops (regexp operations) will usually be slower than one that invokes fewer; but the cost will be less than invoking more ops at the perl level.	[reply] [d/l]
Re: Problem with a text-parsing regex by tybalt89 (Monsignor) on May 08, 2022 at 18:12 UTC
Different way to handle -- #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11143647 use warnings; while (my $x = <DATA>) { chomp $x; print "\n$x\n"; my $out = ''; while ( $x =~ m/ ([[:punct:]]) # $1: leading punct marks ( # $2: a "word" consisting of [[:word:]]+ # word (?: (?: '-? \| - ) [[:word:]]+ # and ending in pure word characters ) ) ( (?: --+ \| [[:punct:]]* ) \ ? ) # $3: trailing punct marks +ending # or multi-dashs # with space (except at end of # line?) /xxg ) { $out .= sprintf "{%s\|%s\|%s} ", # make whitespace visible map {(my $y = $_ // '') =~ tr/ /_/; $y} $1, $2, $3; } print "$out\n" =~ s/ $//r =~ s/.{65}\K /\n/gr; } __DATA__ "'Uncouth' about sums it up." The word they will use is 'uncouth'. "It's the old story." It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. It's two o'clock--time for a nap. Remember 45's? What about (this)? [Editor's note: blah blah] and so on... A ... and B I said--"What's the expression?" [download] Outputs (changed to be able to see all output without scrolling): "'Uncouth' about sums it up." {"'\|Uncouth\|'_} {\|about\|_} {\|sums\|_} {\|it\|_} {\|up\|."} The word they will use is 'uncouth'. {\|The\|_} {\|word\|_} {\|they\|_} {\|will\|_} {\|use\|_} {\|is\|_} {'\|uncouth\|'.} "It's the old story." {"\|It's\|_} {\|the\|_} {\|old\|_} {\|story\|."} It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. {\|It's\|_} {\|a\|_} {\|will-o'-the-wisp\|--} {\|a\|_} {\|two-fer\|--} {\|and\|_} {\|Bob's-yer-uncle\|_} {\|at\|_} {\|four\|_} {\|o'clock\|.} It's two o'clock--time for a nap. {\|It's\|_} {\|two\|_} {\|o'clock\|--} {\|time\|_} {\|for\|_} {\|a\|_} {\|nap\|.} Remember 45's? {\|Remember\|_} {\|45's\|?} What about (this)? {\|What\|_} {\|about\|_} {(\|this\|)?} [Editor's note: blah blah] and so on... {[\|Editor's\|_} {\|note\|:_} {\|blah\|_} {\|blah\|]_} {\|and\|_} {\|so\|_} {\|on\|. +..} A ... and B {\|A\|_} {\|and\|_} {\|B\|} I said--"What's the expression?" {\|I\|_} {\|said\|--} {"\|What's\|_} {\|the\|_} {\|expression\|?"} [download] Did get every thing right?	[reply] [d/l] [select]
Re^2: Problem with a text-parsing regex by hv (Prior) on May 09, 2022 at 01:22 UTC
The instructions were that "--" was to be treated like a space, so presumably should not be part of the punctuation runs - I think it should print `{\|o'clock\|}`, for example, not `{\|o'clock\|--}`. I do prefer your version of the word parsing to mine, but I suspect `(?: '-? \| -'? )` is what's intended. (There aren't any examples of `"word-'word"` in the test cases though - I could probably come up with one in Dutch, but I imagine they're pretty rare in English.)	[reply] [d/l] [select]
Re^3: Problem with a text-parsing regex by ibm1620 (Hermit) on May 09, 2022 at 13:16 UTC
I grepped my collection of text files (all English-language downloads from gutenberg.org) for `-'` and only found forty-'leven and fellow-'prentice. I've updated tybalt89's solution with your improvement. The contents of `$3` will contain a final space if one is present, so `{\|o'clock\|--}` is consistent with the instructions.	[reply] [d/l] [select]
Re^2: Problem with a text-parsing regex by ibm1620 (Hermit) on May 08, 2022 at 23:44 UTC
Yes! Beautiful. Makes perfect sense. Thank you. Your final `print` is a real head-scratcher, but I like what it does and I'll noodle on it some more....	[reply] [d/l]
Re^3: Problem with a text-parsing regex by kcott (Archbishop) on May 11, 2022 at 08:23 UTC
G'day ibm1620, "Your final `print` is a real head-scratcher, but I like what it does and I'll noodle on it some more...." I'm assuming "final `print`" refers to: `print "$out\n" =~ s/ $//r =~ s/.{65}\K /\n/gr;` [download] I don't know which bit, or bits, you're having difficulties with. One, or both, of these references might help: perlop: Regexp Quote-Like Operators: s/PATTERN/REPLACEMENT/msixpodualngcer The `/r` option indicates non-destructive substitution. These can be chained; e.g. `s///r =~ s///r =~ s///r` The same option also exists for transliteration: "perlop: Quote-Like Operators: y/SEARCHLIST/REPLACEMENTLIST/cdsr". These can also be chained; e.g. `y///r =~ y///r =~ y///r` You can also mix chaining; e.g. `s///r =~ y///r =~ s///r =~ y///r` Introduced in `5.14`: "perl5140delta: Non-destructive substitution". perlrebackslash: Misc: \K The `\K` escape sequence indicates that everything to its left should be kept. See also "perlre: Extended Patterns: Lookaround Assertions: \K". Introduced in `5.10`: "perl5100delta: Core Enhancements: Regular expressions: \K escape". I've shown the versions where `/r` and `\K` were introduced. With your stated `5.34.1` version, both of these will be available to you. I included this information for those using older versions of Perl. — Ken	[reply] [d/l] [select]
Re: Problem with a text-parsing regex by Fletch (Bishop) on May 07, 2022 at 20:05 UTC
Rather than trying to roll your own regex (and depending on what you're trying to do with this next) you probably want to look at CPAN and search for NLP modules (Natural Language Processing) instead. Those are likely going to do what you want WRT removing not-words as well as being able to give you more info about the words/tokens it extracts. The cake is a lie. The cake is a lie. The cake is a lie.	[reply]
Re^2: Problem with a text-parsing regex by ibm1620 (Hermit) on May 07, 2022 at 21:00 UTC
Oddly enough, I found almost nothing of use in CPAN related to NLP! That's very surprising. (See, for example, https://metacpan.org/pod/Text::NLP) In any event, there's more idiosyncratic processing of the words and surrounding punctuation, so it's doubtful that any CPAN module would exactly fit my needs.	[reply]
Re^3: Problem with a text-parsing regex by tangent (Parson) on May 07, 2022 at 23:44 UTC
The NLP modules are generally found in the "Lingua" namespace - most of the various part-of-speech taggers, lemmatizers, stemmers etc. have a Perl implementation. Some examples: Lingua::EN::Tagger Lingua::TreeTagger Lingua::FreeLing3 Lingua::EN::SENNA Lingua::CollinsParser Lingua::BrillTagger If you are going to do further NLP type analysis they are worth looking into.	[reply]
Re: Problem with a text-parsing regex by AnomalousMonk (Archbishop) on May 07, 2022 at 22:54 UTC
I would have thought your first step would have been to write a unit test (see Test::More and friends) specifying exactly what you want to parse from things like "'Uncouth' about sums it up.", "It's the old story.", etc. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]