How to tokenize string by custom dictionary?

infantcoder has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Monks. I want to retrieve person's name from a string by name dictionary.

So, get the following way:

1. Built Trie tree (use Tree::Trie) to store person's names by name dictionary.

2. Tokenize the string by name dictionary(has frequency).

3. Query every token if exist in the Trie tree or not.

Name Dictionary(name, frequency pair):

Alex Fong => 100

Fong => 100

Ferenc Kállai => 96

Joe Smith => 95

Sándor Pécsi => 90

John Doe => 89

Sándor Tompa => 62

周杰倫 => 57

纯ちゃん => 2

... ...

Example1:

Input string: "Esther Kwan, 纯ちゃん | Alex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Ferenc Kállai"

Output(order not important):

"Alex Fong"

"周杰倫"

"纯ちゃん"

"Joe Smith"

"Fong"

"Ferenc Kállai"

Example2:

Input string: "You know Alex Fong believe what Fong said ?"

Output(order not important):

"Alex Fong"

"Fong"

Question:

The step2: How to Tokenize a string by a custom dictionary?

Which means: the tokens list in the dictionary can NOT been split.

Are there some Perl modules or toolkits available?

Thank you, Monks.

Comment on How to tokenize string by custom dictionary? Download Code

Replies are listed 'Best First'.
Re: How to tokenize string by custom dictionary? (+code) by LanX (Saint) on Nov 05, 2013 at 15:14 UTC
If I understand your question and your use of the word "tokenize" correctly, you just need a regex where all names are ORed. The regex engine is already trie optimized. updated now that I'm back to a desktop computer let's try... `DB<105> %names =( '纯ちゃん' => 2, '周杰倫' => 57, 'Alex Fong' => 100, ) DB<106> $input=q{"Esther Kwan, 纯ちゃん \| Al +ex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Feren +c Kállai"} DB<107> $regex = join '\|', keys %names DB<108> @matches = ( $input =~ /($regex)/g ) DB<110> print join ",", @matches 纯ちゃん,Alex Fong,周杰倫` [download] here pre formated to display unicodes characters... DB<105> %names =( '纯ちゃん' => 2, '周杰倫' => 57, `Alex Fong` => 100, ) DB<106> $input=q{"Esther Kwan, 纯ちゃん \| Alex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Ferenc Kállai"} DB<107> $regex = join '\|', keys %names DB<108> @matches = ( $input =~ /($regex)/g ) DB<110> print join ",", @matches 纯ちゃん,Alex Fong,周杰倫 update reading your task description again I doubt that your teacher will accept this approach as homework! =) Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l] [select]
Re^2: How to tokenize string by custom dictionary? (+code) by infantcoder (Novice) on Nov 06, 2013 at 03:06 UTC
Thank you @Cheers Rolf, so kindly monk. I have questions: 1. The name have Metacharacters('.', '$' or something like), for example: ke$ha, d.b.cooper, Tim Turner (I)... 2. It is a large quantity of the names, more than 214132. So too slow, from my personal point. 3. Only use @matches = ( $input =~ /($regex)/g ), we can not distinguish the ambiguous name: Alex Fong / Fong, 周杰/周杰伦, 信/方中信... PS: For pure chinese string, I use Lingua::ZH::WordSegment by custom dictionary, it works OK. But other languages or mix with chinese, I can not find a way that ensure the tokens/trunks list in the dictionary NOT been split.	[reply]
Re^3: How to tokenize string by custom dictionary? (+code) by LanX (Saint) on Nov 06, 2013 at 18:02 UTC
you have to escape them use quotemeta or \Q respectively you can either increase buffer-size for trie optimization (see `${^RE_TRIE_MAXBUF}` in perlvar) or always test in chunks of names (like 10000?). I recommend a combination of both. > we can not distinguish the ambiguous name sorry I don't understand. Cheers Rolf ( addicted to the Perl Programming Language)	[reply] [d/l]
Re^4: How to tokenize string by custom dictionary? (+code) by infantcoder (Novice) on Nov 07, 2013 at 03:30 UTC
Re^5: How to tokenize string by custom dictionary? (+code) by LanX (Saint) on Nov 08, 2013 at 19:48 UTC

updated

update