in reply to Splitting compound (concatenated) words )

If you can come up with a list of all words that appear in compound words (not, I think, an easy job), here's one approach. I couldn't think of a triple- or quadruple-word compound to test with. Also, ambiguities in compounding, as exemplified in the reply of ww, are not addressed by this approach.

>perl -wMstrict -le "my @elements = qw(stop sign light painter porch store front); my ($elem) = map qr{ $_ }xms, join q{ | }, @elements ; my $plural = qr{ e? s }xms; my $compound = qr{ \b $elem{2,} (?= $plural? \b) }xms; ;; my $str = qq{a signpainter wanted the stopsigns in front \n} . qq{of all his storefronts` frontporches replaced \n} . qq{by stoplights to reduce accidents in front \n} . qq{of his stores} ; print qq{[$str]}; ;; my @compounds = $str =~ m{ $compound }xmsg; printf qq{'$_' } for @compounds; " [a signpainter wanted the stopsigns in front of all his storefronts` frontporches replaced by stoplights to reduce accidents in front of his stores] 'signpainter' 'stopsign' 'storefront' 'frontporch' 'stoplight'

Update: There must surely be (tens of?) thousands of words that are parts of compounds. The regex alternation formed from all these words will be very long. Even so, my guess is that the regex will compile and execute. OTOH, speed of execution is another matter.

Update: Correction: Regex  qr{ $_ \b }xms in original example code should have been  qr{ $_ }xms – fixed.

Update: Improved example code to recognize plurals – at least in English.

Replies are listed 'Best First'.
Re^2: Splitting compound (concatenated) words )
by bitingduck (Deacon) on May 16, 2012 at 04:20 UTC

    Trying to do it with a regex could be pretty time consuming if the dictionary or the subject text got very long. I worked on a project in a natural language translation class way longer ago than I care to think about, and the approach to making the dictionary was to make a linked tree where each letter was a node, with the possible subsequent letters as the words being child nodes. At the end of each complete word you put a flag node that says "end of word", but for a true compound word you'd have a child node with the next letter and another child with the "EOW" flag.

    Kind of like this (where "." is end of word)

    T /\ H O /\ /\ E I . N /\ /\ \. N . etc.

    This dictionary Includes "To","Ton","The","Then" and starts to spell out "this". It makes finding the combined words fast and straightforward, but it doesn't help with distinguishing true compound words (e.g. "bookkeeper") from things like "theme" which could be "the me"(updated here to correct my bad choice of example). If you're really clever you might use some sort of Markov chain tool to guess that.

    But I don't know the NLP modules well enough to know if there's something kicking around in CPAN. If you have a dictionary to slurp, you could do it yourself fairly easily.

    Update:You might even get by ok with something like Text::SpellChecker

Re^2: Splitting compound (concatenated) words )
by sauoq (Abbot) on May 15, 2012 at 21:42 UTC
    Improved example code to recognize plurals – at least in English.

    Weird... I added 'child' to your list but 'children' wasn't recognized. Then I tried 'goose' and still no luck. None with "hobby" either, I'm afraid.

    . . .

    Okay, I didn't really try them.

    -sauoq
    "My two cents aren't worth a dime.";