in reply to The “real length" of UTF8 strings

As a first approximation you can loop over the characters, and add 0 for combining characters, 1 for "normal" one and 2 for characters in the Han script block.
sub visual_length { my $s = shift; my $l = 0; while ($s =~ m/(.)/g){ my $c = $1; if ($c =~ m/\p{M}){ # do nothing } elsif ($c =~ m/\p{Han}) { $l += 2; } else { $l++; } } return $l; }

That could use much more tweaking, but maybe it's a start for you.

Replies are listed 'Best First'.
Re^2: The “real length" of UTF8 strings
by Anonymous Monk on Sep 24, 2008 at 04:32 UTC

    Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand. I'll try to get more info about its UTF8 code range and if "one char visual length" and "two chars visual length" are not mixed together, that should be good :)

      Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand.

      That's why my example queries each character for the Unicode property \p{Han}, ie if the character is in that script block.

      For a better description of Unicode properties and script blocks in Regexes I recommend "Mastering Regular Expressions" by Jeffrey Friedl, pages 121pp.