Re^4: length() miscounting UTF8 characters?


Perl-Sensitive Sunglasses
	PerlMonks

Re^4: length() miscounting UTF8 characters?

by ikegami (Patriarch)

on Apr 30, 2014 at 18:38 UTC ( [id://1084539]=note: print w/replies, xml )

Need Help??

in reply to Re^3: length() miscounting UTF8 characters?
in thread length() miscounting UTF8 characters?

The problems with length are not around bytes vs. characters, but that length counts code points. Many logical characters are composed from multiple code points

1. What you call "logical character" is an "extended grapheme cluster", which I abbreviate to "grapheme".

2. length doesn't count code points. length always counts characters (string elements). It has no idea what those characters are as that information is neither available nor needed. They are just 32-bit or 64-bit numbers to length. They could be bytes. They could be Unicode code points. But they aren't going to be graphemes (visual character) as there is no existing system to encode graphemes in a single number.

Comment on Re^4: length() miscounting UTF8 characters? Select or Download Code

In Section Seekers of Perl Wisdom

Domain Nodelet^?

www.com | www.net | www.org

Node Status^?

node history
Node Type: note [id://1084539]
help

Chatterbox^?

How do I use this? • Last hour • Other CB clients

Other Users^?

Others pondering the Monastery: (4)

As of 2024-04-25 23:28 GMT

Sections^?

Information^?

Find Nodes^?

Leftovers^?

Today I Learned

Voting Booth^?

No recent polls found