Instead of %words2ids having key-value pairs like
word => 'id'
they could perhaps be more like
word => [qw{id1 id2}]
I'm not really across the specifics of what you want. You should supply a sample tagset file, a sample input file, and the expected output using those two files.
My best guess would be changing these lines:
... $words2ids{fc $text} = $token; ... s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; ...
to
... push @{$words2ids{fc $text}}, $token; ... s/$re/++$found{fc $1}, "$1 @{$words2ids{fc $1}}"/eg; ...
You should try this for yourself. If you run into difficulties, put together an SSCCE and post it: along with the two sample files and the expected output, this will give us a much better chance of quickly resolving whatever problems you're encountering.
— Ken
In reply to Re^3: Finding multiword units in a corpus
by kcott
in thread Finding multiword units in a corpus
by veg_running
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |