infantcoder has asked for the wisdom of the Perl Monks concerning the following question:

Hi, Monks. I want to retrieve person's name from a string by name dictionary.

So, get the following way:

1. Built Trie tree (use Tree::Trie) to store person's names by name dictionary.

2. Tokenize the string by name dictionary(has frequency).

3. Query every token if exist in the Trie tree or not.

Name Dictionary(name, frequency pair):

Alex Fong => 100

Fong => 100

Ferenc Kállai => 96

Joe Smith => 95

Sándor Pécsi => 90

John Doe => 89

Sándor Tompa => 62

周杰倫 => 57

纯ちゃん => 2

... ...

Example1:

Input string: "Esther Kwan, 纯ちゃん | Alex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Ferenc Kállai"

Output(order not important):

"Alex Fong"

"周杰倫"

"纯ちゃん"

"Joe Smith"

"Fong"

"Ferenc Kállai"

Example2:

Input string: "You know Alex Fong believe what Fong said ?"

Output(order not important):

"Alex Fong"

"Fong"

Question:

The step2: How to Tokenize a string by a custom dictionary?

Which means: the tokens list in the dictionary can NOT been split.

Are there some Perl modules or toolkits available?

Thank you, Monks.

Replies are listed 'Best First'.
Re: How to tokenize string by custom dictionary? (+code)
by LanX (Saint) on Nov 05, 2013 at 15:14 UTC
    If I understand your question and your use of the word "tokenize" correctly, you just need a regex where all names are ORed.

    The regex engine is already trie optimized.

    updated

    now that I'm back to a desktop computer let's try...

    DB<105> %names =( '&#32431;&#12385;&#12419;&#12435;' => 2, '&#21608;&#26480;&#20523;' => 57, 'Alex Fong' => 100, ) DB<106> $input=q{"Esther Kwan, &#32431;&#12385;&#12419;&#12435; | Al +ex Fong (Hong Kong) / Joe Smith ; Fong &#21608;&#26480;&#20523; Feren +c Kállai"} DB<107> $regex = join '|', keys %names DB<108> @matches = ( $input =~ /($regex)/g ) DB<110> print join ",", @matches &#32431;&#12385;&#12419;&#12435;,Alex Fong,&#21608;&#26480;&#20523;

    here pre formated to display unicodes characters...

      DB<105> %names =(
               '纯ちゃん' => 2,
               '周杰倫' => 57,
               Alex Fong => 100,
               )

    DB<106> $input=q{"Esther Kwan, 纯ちゃん | Alex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Ferenc Kállai"}

    DB<107> $regex = join '|', keys %names

    DB<108> @matches = ( $input =~ /($regex)/g )

    DB<110> print join ",", @matches 纯ちゃん,Alex Fong,周杰倫

    update

    reading your task description again I doubt that your teacher will accept this approach as homework! =)

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thank you @Cheers Rolf, so kindly monk.

      I have questions:

      1. The name have Metacharacters('.', '$' or something like), for example: ke$ha, d.b.cooper, Tim Turner (I)...

      2. It is a large quantity of the names, more than 214132. So too slow, from my personal point.

      3. Only use @matches = ( $input =~ /($regex)/g ), we can not distinguish the ambiguous name: Alex Fong / Fong, 周杰/周杰伦, 信/方中信...

      PS:

      For pure chinese string, I use Lingua::ZH::WordSegment by custom dictionary, it works OK.

      But other languages or mix with chinese, I can not find a way that ensure the tokens/trunks list in the dictionary NOT been split.

        1. you have to escape them use quotemeta or \Q respectively

        2. you can either increase buffer-size for trie optimization (see ${^RE_TRIE_MAXBUF} in perlvar) or always test in chunks of names (like 10000?). I recommend a combination of both.

        3. > we can not distinguish the ambiguous name

          sorry I don't understand.

        Cheers Rolf

        ( addicted to the Perl Programming Language)