in reply to Finding multiword units in a corpus
Currently, you have a single-level data structure in a hash that maps a "word" (potentially two or more, space-separated) to a token.
Your code only processes "words" without a space in them.
Hashes support lookup by single entries very well. They don't do well for substring lookups.
I would therefore restructure your token-finding data structure to a multi-level data structure that is organized as a tree of hashes, with the keys being the words:
my $%words2ids = ( ulimi => '<ZUL-SIL-0017-n>', izinyo => { '' => '<ZUL-SIL-0018-n>', # no known word follows lomhlathi => '<ZUL-SIL-0019-n>', }, ingemuva => { lomqala => '<ZUL-SIL-0024-n>', }, );
I use the empty string if the word is found and is not followed by any word associated with it.
Then, when looking up words in the structure, if you find a token, you output it. Otherwise, if you don't find the word at all, output the token associated with the empty string at the previous position. If you find the token, you descend further in that tree.
Consider playing through the approach with pen and paper first to get an understanding of the data structure.
|
|---|