Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re: The “real length" of UTF8 strings

by moritz (Cardinal)
on Sep 23, 2008 at 20:30 UTC ( [id://713302]=note: print w/replies, xml ) Need Help??


in reply to The “real length" of UTF8 strings

As a first approximation you can loop over the characters, and add 0 for combining characters, 1 for "normal" one and 2 for characters in the Han script block.
sub visual_length { my $s = shift; my $l = 0; while ($s =~ m/(.)/g){ my $c = $1; if ($c =~ m/\p{M}){ # do nothing } elsif ($c =~ m/\p{Han}) { $l += 2; } else { $l++; } } return $l; }

That could use much more tweaking, but maybe it's a start for you.

Replies are listed 'Best First'.
Re^2: The “real length" of UTF8 strings
by Anonymous Monk on Sep 24, 2008 at 04:32 UTC

    Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand. I'll try to get more info about its UTF8 code range and if "one char visual length" and "two chars visual length" are not mixed together, that should be good :)

      Sure, but the Han script is probably about 40000 characters big: no way to write a list by hand.

      That's why my example queries each character for the Unicode property \p{Han}, ie if the character is in that script block.

      For a better description of Unicode properties and script blocks in Regexes I recommend "Mastering Regular Expressions" by Jeffrey Friedl, pages 121pp.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://713302]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others examining the Monastery: (4)
As of 2024-03-29 11:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found