If you can come up with a list of all words that appear in compound words (not, I think, an easy job), here's one approach. I couldn't think of a triple- or quadruple-word compound to test with. Also, ambiguities in compounding, as exemplified in the reply of ww, are not addressed by this approach.

>perl -wMstrict -le "my @elements = qw(stop sign light painter porch store front); my ($elem) = map qr{ $_ }xms, join q{ | }, @elements ; my $plural = qr{ e? s }xms; my $compound = qr{ \b $elem{2,} (?= $plural? \b) }xms; ;; my $str = qq{a signpainter wanted the stopsigns in front \n} . qq{of all his storefronts` frontporches replaced \n} . qq{by stoplights to reduce accidents in front \n} . qq{of his stores} ; print qq{[$str]}; ;; my @compounds = $str =~ m{ $compound }xmsg; printf qq{'$_' } for @compounds; " [a signpainter wanted the stopsigns in front of all his storefronts` frontporches replaced by stoplights to reduce accidents in front of his stores] 'signpainter' 'stopsign' 'storefront' 'frontporch' 'stoplight'

Update: There must surely be (tens of?) thousands of words that are parts of compounds. The regex alternation formed from all these words will be very long. Even so, my guess is that the regex will compile and execute. OTOH, speed of execution is another matter.

Update: Correction: Regex  qr{ $_ \b }xms in original example code should have been  qr{ $_ }xms – fixed.

Update: Improved example code to recognize plurals – at least in English.


In reply to Re: Splitting compound (concatenated) words ) by AnomalousMonk
in thread Splitting compound (concatenated) words ) by vit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.