in reply to Re^2: Finding multiword units in a corpus
in thread Finding multiword units in a corpus
Instead of %words2ids having key-value pairs like
word => 'id'
they could perhaps be more like
word => [qw{id1 id2}]
I'm not really across the specifics of what you want. You should supply a sample tagset file, a sample input file, and the expected output using those two files.
My best guess would be changing these lines:
... $words2ids{fc $text} = $token; ... s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; ...
to
... push @{$words2ids{fc $text}}, $token; ... s/$re/++$found{fc $1}, "$1 @{$words2ids{fc $1}}"/eg; ...
You should try this for yourself. If you run into difficulties, put together an SSCCE and post it: along with the two sample files and the expected output, this will give us a much better chance of quickly resolving whatever problems you're encountering.
— Ken
|
|---|