Dear Monks,

I'm having difficulty with a regexp to split English text into the sort of elements I need.

Original plan was to chop up lines of text into whitespace-separated chunks, and separate out leading and trailing punctuation into separate variables, producing three values: $pre, $word, and $post. $post's final character would be the whitespace character separating it from the next chunk.

Several complications: I want to allow a "word" to be a hyphenated term (two-fer, Bob's-yer-uncle, will-o'-the-wisp); I want to allow embedded apostrophes (o'clock, it's); and I want to treat two or more hyphens in a row as equivalent to a whitespace character that separates the chunks.

The following almost works the way I want it to. I've noted where it fails. I can generally see what causes a failure, but fixing it always breaks something else.

As always, thanks for your generous help!

#!/usr/bin/env perl use 5.010; use warnings; use strict; my $n; # line no while (my $x = <DATA>) { chomp $x; say $x; while ( $x =~ m/ ([[:punct:]]*) # $1: leading punct marks ( # $2: a "word" consisting of (?: [[:word:]']+ - )* # optional segments with # embedded {'}s ending with # single {-} [[:word:]]+ # and ending in pure word characters ) ([[:punct:]]* \ ? ) # $3: trailing punct marks ending # with space (except at end of # line?) /xxg ) { printf " %3s {%s|%s|%s}\n", ++$n, # make whitespace visible map {(my $y = $_ // '') =~ tr/ /_/; $y} $1, $2, $3; } } __DATA__ "'Uncouth' about sums it up." The word they will use is 'uncouth'. "It's the old story." It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. It's two o'clock--time for a nap. Remember 45's? What about (this)? [Editor's note: blah blah] and so on... A ... and B I said--"What's the expression?"
Output:
"'Uncouth' about sums it up." 1 {"'|Uncouth|'_} 2 {|about|_} 3 {|sums|_} 4 {|it|_} 5 {|up|."} The word they will use is 'uncouth'. 6 {|The|_} 7 {|word|_} 8 {|they|_} 9 {|will|_} 10 {|use|_} 11 {|is|_} 12 {'|uncouth|'.} "It's the old story." 13 {"|It|'} <- should be {"|It's|_} 14 {|s|_} 15 {|the|_} 16 {|old|_} 17 {|story|."} It's a will-o'-the-wisp--a two-fer--and Bob's-yer-uncle at four o'cloc +k. 18 {|It|'} <- same problem 19 {|s|_} 20 {|a|_} 21 {|will-o'-the-wisp|--} <- perfect! 22 {|a|_} 23 {|two-fer|--} 24 {|and|_} 25 {|Bob's-yer-uncle|_} 26 {|at|_} 27 {|four|_} 28 {|o|'} <- should be {|o'clock|.} 29 {|clock|.} It's two o'clock--time for a nap. 30 {|It|'} 31 {|s|_} 32 {|two|_} 33 {|o|'} <- should be {|o'clock|--} 34 {|clock|--} 35 {|time|_} 36 {|for|_} 37 {|a|_} 38 {|nap|.} Remember 45's? 39 {|Remember|_} 40 {|45|'} <- 41 {|s|?} What about (this)? 42 {|What|_} 43 {|about|_} 44 {(|this|)?} [Editor's note: blah blah] and so on... 45 {[|Editor|'} <- 46 {|s|_} 47 {|note|:_} 48 {|blah|_} 49 {|blah|]_} 50 {|and|_} 51 {|so|_} 52 {|on|...} A ... and B 53 {|A|_} <- correct to omit detached elipsis 54 {|and|_} 55 {|B|} I said--"What's the expression?" 56 {|I|_} 57 {|said|--"} <- should be {|said|--} 58 {|What|'} <- should be {"|What's|_} 59 {|s|_} 60 {|the|_} 61 {|expression|?"}

In reply to Problem with a text-parsing regex by ibm1620

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.