in reply to Re^3: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price

UCS-2 isn't variable width, so I think it was an error to mention it.

UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages.

What advantage does UTF-8 have over UTF-16?

I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.

  • Comment on Re^4: Best Way to Get Length of UTF-8 String in Bytes?

Replies are listed 'Best First'.
Re^5: Best Way to Get Length of UTF-8 String in Bytes?
by Anonymous Monk on Apr 25, 2011 at 06:40 UTC
    I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.

    IIRC, UTF-8 with BOM is unmistakable for iso-8859-*. :)

      True, but very few database fields, HTML element contents, strings, etc start with a BOM. In fact, it wouldn't even be appropriate for them to start with a BOM.
Re^5: Best Way to Get Length of UTF-8 String in Bytes?
by tchrist (Pilgrim) on Apr 27, 2011 at 05:30 UTC
    I mentioned UCS-2 because it, like UTF-16, is a mess. Except that it’s worse. Anyway, it wasn’t a mistake to mention them together, and I never even vaguely implied that both were or were not fixed width. I said people who jumped on either of those have paid for doing so, because both are problematic. Plus of course UCS-2 doesn't encode most of Unicode.

    As for the many many advantages of UTF-8 over UTF-16, I can hardly begin to list them all. Sorts right numerically. Not dependent on endianness. Compat with ASCII. Smaller. Works with the C syscall interface without retooling. Never uses NULL bytes unless it means them to be NULL bytes. Never lets lazy programmers trick themselves into thinking it is fixed width. Is that enough for you?

    The Wikipedia page has a lot of them.

      I never even vaguely implied that both were or were not fixed width.

      The only pronoun substitution that makes sense to me is "No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price."

      Feel free to clarify if you meant something else, but it's probably unimportant.

      Not dependent on endianness.

      If you use UTF-16be or UTF-8 or iso-8859-1 when you should be using UTF-16le, it won't work, and it has nothing to do with endianness, and it applies equally to UTF-8, UTF-16le and UTF-16be.

      Compat with ASCII. Smaller. Works with the C syscall interface without retooling.

      Both of these have been and continue to be endless source of bugs. I don't consider that an advantage.

      Sorts right numerically.

      I don't know what you mean. You can't sort the bytes of UTF-8, UTF-16le or UTF-16be, so you must mean code points, but code points aren't encoding-specific.

      Never uses NULL bytes unless it means them to be NULL bytes.

      More generally, it makes it a harder to use sentinel values. (You have to know whether you're at an odd offset or not, and the sentinel has to be two bytes long.) Granted.

      Never lets lazy programmers trick themselves into thinking it is fixed width.

      Well, I don't think UTF-8 is much better there. Sure, there's no misconception about codepoints being fixed widths when using UTF-8, but that's not worth much as long as graphemes are commonly thought to be fixed width.