in reply to Re: How to tokenize string by custom dictionary? (+code)
in thread How to tokenize string by custom dictionary?
Thank you @Cheers Rolf, so kindly monk.
I have questions:
1. The name have Metacharacters('.', '$' or something like), for example: ke$ha, d.b.cooper, Tim Turner (I)...
2. It is a large quantity of the names, more than 214132. So too slow, from my personal point.
3. Only use @matches = ( $input =~ /($regex)/g ), we can not distinguish the ambiguous name: Alex Fong / Fong, 周杰/周杰伦, 信/方中信...
PS:
For pure chinese string, I use Lingua::ZH::WordSegment by custom dictionary, it works OK.
But other languages or mix with chinese, I can not find a way that ensure the tokens/trunks list in the dictionary NOT been split.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: How to tokenize string by custom dictionary? (+code)
by LanX (Saint) on Nov 06, 2013 at 18:02 UTC | |
by infantcoder (Novice) on Nov 07, 2013 at 03:30 UTC | |
by LanX (Saint) on Nov 08, 2013 at 19:48 UTC |