in reply to Re: How to tokenize string by custom dictionary? (+code)
in thread How to tokenize string by custom dictionary?

Thank you @Cheers Rolf, so kindly monk.

I have questions:

1. The name have Metacharacters('.', '$' or something like), for example: ke$ha, d.b.cooper, Tim Turner (I)...

2. It is a large quantity of the names, more than 214132. So too slow, from my personal point.

3. Only use @matches = ( $input =~ /($regex)/g ), we can not distinguish the ambiguous name: Alex Fong / Fong, 周杰/周杰伦, 信/方中信...

PS:

For pure chinese string, I use Lingua::ZH::WordSegment by custom dictionary, it works OK.

But other languages or mix with chinese, I can not find a way that ensure the tokens/trunks list in the dictionary NOT been split.

  • Comment on Re^2: How to tokenize string by custom dictionary? (+code)

Replies are listed 'Best First'.
Re^3: How to tokenize string by custom dictionary? (+code)
by LanX (Saint) on Nov 06, 2013 at 18:02 UTC

    1. you have to escape them use quotemeta or \Q respectively

    2. you can either increase buffer-size for trie optimization (see ${^RE_TRIE_MAXBUF} in perlvar) or always test in chunks of names (like 10000?). I recommend a combination of both.

    3. > we can not distinguish the ambiguous name

      sorry I don't understand.

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Thank you very much, @Cheers Rolf. It wroks.

      Come to Beijing, Just call me, beers waitting for you. ^_^

        > Come to Beijing, Just call me, beers waitting for you. ^_^

        sure, just message me me your phone number ... ;-)

        Cheers Rolf

        ( addicted to the Perl Programming Language)