in reply to Re^2: Finding multiword units in a corpus
in thread Finding multiword units in a corpus

Instead of %words2ids having key-value pairs like

word => 'id'

they could perhaps be more like

word => [qw{id1 id2}]

I'm not really across the specifics of what you want. You should supply a sample tagset file, a sample input file, and the expected output using those two files.

My best guess would be changing these lines:

... $words2ids{fc $text} = $token; ... s/$re/++$found{fc $1}, "$1 $words2ids{fc $1}"/eg; ...

to

... push @{$words2ids{fc $text}}, $token; ... s/$re/++$found{fc $1}, "$1 @{$words2ids{fc $1}}"/eg; ...

You should try this for yourself. If you run into difficulties, put together an SSCCE and post it: along with the two sample files and the expected output, this will give us a much better chance of quickly resolving whatever problems you're encountering.

— Ken