in reply to Re: The “real length" of UTF8 strings
in thread The “real length" of UTF8 strings

Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand. I'll try to get more info about its UTF8 code range and if "one char visual length" and "two chars visual length" are not mixed together, that should be good :)

  • Comment on Re^2: The “real length" of UTF8 strings

Replies are listed 'Best First'.
Re^3: The “real length" of UTF8 strings
by moritz (Cardinal) on Sep 24, 2008 at 07:57 UTC
    Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand.

    That's why my example queries each character for the Unicode property \p{Han}, ie if the character is in that script block.

    For a better description of Unicode properties and script blocks in Regexes I recommend "Mastering Regular Expressions" by Jeffrey Friedl, pages 121pp.