Trying to do it with a regex could be pretty time consuming if the dictionary or the subject text got very long. I worked on a project in a natural language translation class way longer ago than I care to think about, and the approach to making the dictionary was to make a linked tree where each letter was a node, with the possible subsequent letters as the words being child nodes. At the end of each complete word you put a flag node that says "end of word", but for a true compound word you'd have a child node with the next letter and another child with the "EOW" flag.

Kind of like this (where "." is end of word)

T /\ H O /\ /\ E I . N /\ /\ \. N . etc.

This dictionary Includes "To","Ton","The","Then" and starts to spell out "this". It makes finding the combined words fast and straightforward, but it doesn't help with distinguishing true compound words (e.g. "bookkeeper") from things like "theme" which could be "the me"(updated here to correct my bad choice of example). If you're really clever you might use some sort of Markov chain tool to guess that.

But I don't know the NLP modules well enough to know if there's something kicking around in CPAN. If you have a dictionary to slurp, you could do it yourself fairly easily.

Update:You might even get by ok with something like Text::SpellChecker


In reply to Re^2: Splitting compound (concatenated) words ) by bitingduck
in thread Splitting compound (concatenated) words ) by vit

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.