Hi, Monks. I want to retrieve person's name from a string by name dictionary.

So, get the following way:

1. Built Trie tree (use Tree::Trie) to store person's names by name dictionary.

2. Tokenize the string by name dictionary(has frequency).

3. Query every token if exist in the Trie tree or not.

Name Dictionary(name, frequency pair):

Alex Fong => 100

Fong => 100

Ferenc Kállai => 96

Joe Smith => 95

Sándor Pécsi => 90

John Doe => 89

Sándor Tompa => 62

周杰倫 => 57

纯ちゃん => 2

... ...

Example1:

Input string: "Esther Kwan, 纯ちゃん | Alex Fong (Hong Kong) / Joe Smith ; Fong 周杰倫 Ferenc Kállai"

Output(order not important):

"Alex Fong"

"周杰倫"

"纯ちゃん"

"Joe Smith"

"Fong"

"Ferenc Kállai"

Example2:

Input string: "You know Alex Fong believe what Fong said ?"

Output(order not important):

"Alex Fong"

"Fong"

Question:

The step2: How to Tokenize a string by a custom dictionary?

Which means: the tokens list in the dictionary can NOT been split.

Are there some Perl modules or toolkits available?

Thank you, Monks.

In reply to How to tokenize string by custom dictionary? by infantcoder

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.