in reply to Re^2: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions.

Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense.

With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words.

Yes and no.

The no parts are that you seem to have confused the UTF‑8 with Unicode. Unicode is here to stay, and does not share in UTF‑8’s flaws. But realistically, you are simply never going to get rid of UTF‑8 as a transfer format. Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? That will never happen.

The yes part is that I agree that int is the new char. No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points. Because Unicode is a 21‑bit charset, that means you need 32‑bit characters.

It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price, since UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages.

At least Perl didn’t make that particular brain-damaged mistake! It could have been much worse. UTF‑8 is now the de facto standard, and I am very glad that Perl didn’t do the stupid thing that Java and so many others did: just try matching non-BMP code points in character classes, for example. Can’t do it in the UTF-16 languages. Oops! :(

Replies are listed 'Best First'.
Re^4: Best Way to Get Length of UTF-8 String in Bytes?
by BrowserUk (Patriarch) on Apr 24, 2011 at 19:23 UTC
    Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32?

    I think that this 'space' argument is a complete crock.

    Firstly, if saving diskspace is the primary criteria--even though disk space is cheaper today than ever before, then gzipping or even bzip2ing is far, far more effective than variable-length characters. Even if you expand UTF-8 to UTF-32 prior to gzipping, the resultant filesize is hardly affected.

    Secondly, if saving ram is the criteria then load the data zipped and expand it in memory. Disk is slow and ram is fast. The cost of unpacking on demand is offset, almost if not entirely, by the time saved reading smaller volumes from disk.

    Finally, if the 'new' encoding scheme addressed the problem of having the data itself identify what it is--through (say) an expanded BOM mark mechanism or similar-- then there would be no need to convert legacy data.

    It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price

    Maybe so. But then when the UCS-2 schema was adopted, the full Unicode standard was nothing more than a twinkle in the eyes of its beholders. And those that adopted it had 10 or so years of workable solution before Unicode got its act together. (In so far as it has:)

    I remember the Byte magazine article entitled somthing like "Universal Character Set versus Unicode" circa. 1993? From memory, it came down pretty clearly on the side of UCS at that time. UCS may not be the best solution now, but for the best part of 15 years it offered a solution that Unicode is only getting around matching to in the last couple or 3 years .

    I can't address the specifics though. I rarely ever encounter anything beyond ascii data, so I've not bothered with it.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^4: Best Way to Get Length of UTF-8 String in Bytes?
by ikegami (Patriarch) on Apr 25, 2011 at 05:05 UTC

    It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price

    UCS-2 isn't variable width, so I think it was an error to mention it.

    UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages.

    What advantage does UTF-8 have over UTF-16?

    I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.

      I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.

      IIRC, UTF-8 with BOM is unmistakable for iso-8859-*. :)

        True, but very few database fields, HTML element contents, strings, etc start with a BOM. In fact, it wouldn't even be appropriate for them to start with a BOM.
      I mentioned UCS-2 because it, like UTF-16, is a mess. Except that it’s worse. Anyway, it wasn’t a mistake to mention them together, and I never even vaguely implied that both were or were not fixed width. I said people who jumped on either of those have paid for doing so, because both are problematic. Plus of course UCS-2 doesn't encode most of Unicode.

      As for the many many advantages of UTF-8 over UTF-16, I can hardly begin to list them all. Sorts right numerically. Not dependent on endianness. Compat with ASCII. Smaller. Works with the C syscall interface without retooling. Never uses NULL bytes unless it means them to be NULL bytes. Never lets lazy programmers trick themselves into thinking it is fixed width. Is that enough for you?

      The Wikipedia page has a lot of them.

        I never even vaguely implied that both were or were not fixed width.

        The only pronoun substitution that makes sense to me is "No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price."

        Feel free to clarify if you meant something else, but it's probably unimportant.

        Not dependent on endianness.

        If you use UTF-16be or UTF-8 or iso-8859-1 when you should be using UTF-16le, it won't work, and it has nothing to do with endianness, and it applies equally to UTF-8, UTF-16le and UTF-16be.

        Compat with ASCII. Smaller. Works with the C syscall interface without retooling.

        Both of these have been and continue to be endless source of bugs. I don't consider that an advantage.

        Sorts right numerically.

        I don't know what you mean. You can't sort the bytes of UTF-8, UTF-16le or UTF-16be, so you must mean code points, but code points aren't encoding-specific.

        Never uses NULL bytes unless it means them to be NULL bytes.

        More generally, it makes it a harder to use sentinel values. (You have to know whether you're at an odd offset or not, and the sentinel has to be two bytes long.) Granted.

        Never lets lazy programmers trick themselves into thinking it is fixed width.

        Well, I don't think UTF-8 is much better there. Sure, there's no misconception about codepoints being fixed widths when using UTF-8, but that's not worth much as long as graphemes are commonly thought to be fixed width.