Re: How to tokenize string by custom dictionary? (+code)

If I understand your question and your use of the word "tokenize" correctly, you just need a regex where all names are ORed.

The regex engine is already trie optimized.

updated

now that I'm back to a desktop computer let's try...

  DB<105> %names =(
           '&#32431;&#12385;&#12419;&#12435;' => 2,
           '&#21608;&#26480;&#20523;' => 57,
           'Alex Fong' => 100,
           )

  DB<106> $input=q{"Esther Kwan, &#32431;&#12385;&#12419;&#12435; | Al
+ex Fong (Hong Kong) / Joe Smith ; Fong &#21608;&#26480;&#20523; Feren
+c Kállai"}

  DB<107> $regex = join '|', keys %names

  DB<108> @matches = ( $input =~ /($regex)/g )

  DB<110> print join ",", @matches
&#32431;&#12385;&#12419;&#12435;,Alex Fong,&#21608;&#26480;&#20523;
[download]

here pre formated to display unicodes characters...

  DB<105> %names =(
           '纯ちゃん' => 2,
           '周杰倫' => 57,
           Alex Fong => 100,
           )


  DB<106> $input=q{"Esther Kwan, 纯ちゃん | Alex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Ferenc Kállai"}


  DB<107> $regex = join '|', keys %names


  DB<108> @matches = ( $input =~ /($regex)/g )


  DB<110> print join ",", @matches
纯ちゃん,Alex Fong,周杰倫

update

reading your task description again I doubt that your teacher will accept this approach as homework! =)

Cheers Rolf

( addicted to the Perl Programming Language)

Comment on Re: How to tokenize string by custom dictionary? (+code) Select or Download Code

Replies are listed 'Best First'.
Re^2: How to tokenize string by custom dictionary? (+code) by infantcoder (Novice) on Nov 06, 2013 at 03:06 UTC
Thank you @Cheers Rolf, so kindly monk. I have questions: 1. The name have Metacharacters('.', '$' or something like), for example: ke$ha, d.b.cooper, Tim Turner (I)... 2. It is a large quantity of the names, more than 214132. So too slow, from my personal point. 3. Only use @matches = ( $input =~ /($regex)/g ), we can not distinguish the ambiguous name: Alex Fong / Fong, 周杰/周杰伦, 信/方中信... PS: For pure chinese string, I use Lingua::ZH::WordSegment by custom dictionary, it works OK. But other languages or mix with chinese, I can not find a way that ensure the tokens/trunks list in the dictionary NOT been split.	[reply]
Re^3: How to tokenize string by custom dictionary? (+code) by LanX (Saint) on Nov 06, 2013 at 18:02 UTC
you have to escape them use quotemeta or \Q respectively you can either increase buffer-size for trie optimization (see `${^RE_TRIE_MAXBUF}` in perlvar) or always test in chunks of names (like 10000?). I recommend a combination of both. > we can not distinguish the ambiguous name sorry I don't understand. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l]
Re^4: How to tokenize string by custom dictionary? (+code) by infantcoder (Novice) on Nov 07, 2013 at 03:30 UTC
Thank you very much, @Cheers Rolf. It wroks. Come to Beijing, Just call me, beers waitting for you. ^_^	[reply]
Re^5: How to tokenize string by custom dictionary? (+code) by LanX (Saint) on Nov 08, 2013 at 19:48 UTC