in reply to Re^5: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

I never even vaguely implied that both were or were not fixed width.

The only pronoun substitution that makes sense to me is "No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price."

Feel free to clarify if you meant something else, but it's probably unimportant.

Not dependent on endianness.

If you use UTF-16be or UTF-8 or iso-8859-1 when you should be using UTF-16le, it won't work, and it has nothing to do with endianness, and it applies equally to UTF-8, UTF-16le and UTF-16be.

Compat with ASCII. Smaller. Works with the C syscall interface without retooling.

Both of these have been and continue to be endless source of bugs. I don't consider that an advantage.

Sorts right numerically.

I don't know what you mean. You can't sort the bytes of UTF-8, UTF-16le or UTF-16be, so you must mean code points, but code points aren't encoding-specific.

Never uses NULL bytes unless it means them to be NULL bytes.

More generally, it makes it a harder to use sentinel values. (You have to know whether you're at an odd offset or not, and the sentinel has to be two bytes long.) Granted.

Never lets lazy programmers trick themselves into thinking it is fixed width.

Well, I don't think UTF-8 is much better there. Sure, there's no misconception about codepoints being fixed widths when using UTF-8, but that's not worth much as long as graphemes are commonly thought to be fixed width.

  • Comment on Re^6: Best Way to Get Length of UTF-8 String in Bytes?