Re: The “real length" of UTF8 strings

As a first approximation you can loop over the characters, and add 0 for combining characters, 1 for "normal" one and 2 for characters in the Han script block.

sub visual_length {
    my $s = shift;
    my $l = 0;
    while ($s =~ m/(.)/g){
        my $c = $1;
        if ($c =~ m/\p{M}){
            # do nothing
        } elsif ($c =~ m/\p{Han}) {
            $l += 2;
        } else {
            $l++;
        }
    }
    return $l;
}
[download]

That could use much more tweaking, but maybe it's a start for you.

Comment on Re: The “real length" of UTF8 strings Select or Download Code

Replies are listed 'Best First'.
Re^2: The “real length" of UTF8 strings by Anonymous Monk on Sep 24, 2008 at 04:32 UTC
Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand. I'll try to get more info about its UTF8 code range and if "one char visual length" and "two chars visual length" are not mixed together, that should be good :)	[reply]
Re^3: The “real length" of UTF8 strings by moritz (Cardinal) on Sep 24, 2008 at 07:57 UTC
Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand. That's why my example queries each character for the Unicode property `\p{Han}`, ie if the character is in that script block. For a better description of Unicode properties and script blocks in Regexes I recommend "Mastering Regular Expressions" by Jeffrey Friedl, pages 121pp.	[reply] [d/l]