Currently, you have a single-level data structure in a hash that maps a "word" (potentially two or more, space-separated) to a token.

Your code only processes "words" without a space in them.

Hashes support lookup by single entries very well. They don't do well for substring lookups.

I would therefore restructure your token-finding data structure to a multi-level data structure that is organized as a tree of hashes, with the keys being the words:

my $%words2ids = ( ulimi => '<ZUL-SIL-0017-n>', izinyo => { '' => '<ZUL-SIL-0018-n>', # no known word follows lomhlathi => '<ZUL-SIL-0019-n>', }, ingemuva => { lomqala => '<ZUL-SIL-0024-n>', }, );

I use the empty string if the word is found and is not followed by any word associated with it.

Then, when looking up words in the structure, if you find a token, you output it. Otherwise, if you don't find the word at all, output the token associated with the empty string at the previous position. If you find the token, you descend further in that tree.

Consider playing through the approach with pen and paper first to get an understanding of the data structure.


In reply to Re: Finding multiword units in a corpus by Corion
in thread Finding multiword units in a corpus by veg_running

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.