in reply to Splitting compound (concatenated) words )
If you can come up with a list of all words that appear in compound words (not, I think, an easy job), here's one approach. I couldn't think of a triple- or quadruple-word compound to test with. Also, ambiguities in compounding, as exemplified in the reply of ww, are not addressed by this approach.
>perl -wMstrict -le "my @elements = qw(stop sign light painter porch store front); my ($elem) = map qr{ $_ }xms, join q{ | }, @elements ; my $plural = qr{ e? s }xms; my $compound = qr{ \b $elem{2,} (?= $plural? \b) }xms; ;; my $str = qq{a signpainter wanted the stopsigns in front \n} . qq{of all his storefronts` frontporches replaced \n} . qq{by stoplights to reduce accidents in front \n} . qq{of his stores} ; print qq{[$str]}; ;; my @compounds = $str =~ m{ $compound }xmsg; printf qq{'$_' } for @compounds; " [a signpainter wanted the stopsigns in front of all his storefronts` frontporches replaced by stoplights to reduce accidents in front of his stores] 'signpainter' 'stopsign' 'storefront' 'frontporch' 'stoplight'
Update: There must surely be (tens of?) thousands of words that are parts of compounds. The regex alternation formed from all these words will be very long. Even so, my guess is that the regex will compile and execute. OTOH, speed of execution is another matter.
Update: Correction: Regex qr{ $_ \b }xms in original example code should have been qr{ $_ }xms – fixed.
Update: Improved example code to recognize plurals – at least in English.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Splitting compound (concatenated) words )
by bitingduck (Deacon) on May 16, 2012 at 04:20 UTC | |
|
Re^2: Splitting compound (concatenated) words )
by sauoq (Abbot) on May 15, 2012 at 21:42 UTC |