in reply to Re^2: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

It is an intentional properly of UTF-8 encoding that although variable length, you can easily figure out that you're in the middle of a character and where whole characters begin. Continuation bytes always start with the bits 10xxxxxx. Single-byte characters always have a high bit of 0 (0xxxxxxx), and multi-byte characters always start with a byte that has as many leading 1 bits as there are bytes total: 110xxxxx for two bytes, 1110xxxx for three bytes, etc.

So, start at position N of the utf-8 encoded byte string that is the maximum length. While the byte at position N is a continuation byte, decrement N. Now you can truncate to length N.

To prevent clipping the accents off a base character or something like that, you can furthermore look at the whole character beginning at N. Check the Unicode Properties to see if it's a modifier or something. If it is, decrement N again repeat.

  • Comment on Re^3: Best Way to Get Length of UTF-8 String in Bytes?