in reply to Re: Best Way to Get Length of UTF-8 String in Bytes?
in thread Best Way to Get Length of UTF-8 String in Bytes?

Whatever in the world do you want it for, anyway?

Obtaining knowledge of the storage requirements for a piece of data does not seem such an unusual requirement to me. Whether it is for sizing a buffer for interfacing to a C (other) language library; or for length-prefixing a packet for a transmission protocol; or for indexing a file; or any of a dozen other legitimate uses.

Indeed, given that this information is readily & trivially available to Perl:

#! perl -slw use strict; use Devel::Peek; use charnames qw( :full ); my $text = "\N{LATIN SMALL LETTER E WITH ACUTE}"; print length $text; Dump $text; __END__ C:\test>junk.pl 1 SV = PVMG(0x27fea8) at 0x237ab8 REFCNT = 1 FLAGS = (PADMY,SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x3c78e8 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 8 MAGIC = 0x23c648 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 1

the absence of a simple built-in mechanism for obtaining it seems both remiss and arbitrary.

But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions.

Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense.

With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^3: Best Way to Get Length of UTF-8 String in Bytes?
by tchrist (Pilgrim) on Apr 24, 2011 at 16:13 UTC
    But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions.

    Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense.

    With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words.

    Yes and no.

    The no parts are that you seem to have confused the UTF‑8 with Unicode. Unicode is here to stay, and does not share in UTF‑8’s flaws. But realistically, you are simply never going to get rid of UTF‑8 as a transfer format. Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? That will never happen.

    The yes part is that I agree that int is the new char. No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points. Because Unicode is a 21‑bit charset, that means you need 32‑bit characters.

    It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price, since UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages.

    At least Perl didn’t make that particular brain-damaged mistake! It could have been much worse. UTF‑8 is now the de facto standard, and I am very glad that Perl didn’t do the stupid thing that Java and so many others did: just try matching non-BMP code points in character classes, for example. Can’t do it in the UTF-16 languages. Oops! :(

      Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32?

      I think that this 'space' argument is a complete crock.

      Firstly, if saving diskspace is the primary criteria--even though disk space is cheaper today than ever before, then gzipping or even bzip2ing is far, far more effective than variable-length characters. Even if you expand UTF-8 to UTF-32 prior to gzipping, the resultant filesize is hardly affected.

      Secondly, if saving ram is the criteria then load the data zipped and expand it in memory. Disk is slow and ram is fast. The cost of unpacking on demand is offset, almost if not entirely, by the time saved reading smaller volumes from disk.

      Finally, if the 'new' encoding scheme addressed the problem of having the data itself identify what it is--through (say) an expanded BOM mark mechanism or similar-- then there would be no need to convert legacy data.

      It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price

      Maybe so. But then when the UCS-2 schema was adopted, the full Unicode standard was nothing more than a twinkle in the eyes of its beholders. And those that adopted it had 10 or so years of workable solution before Unicode got its act together. (In so far as it has:)

      I remember the Byte magazine article entitled somthing like "Universal Character Set versus Unicode" circa. 1993? From memory, it came down pretty clearly on the side of UCS at that time. UCS may not be the best solution now, but for the best part of 15 years it offered a solution that Unicode is only getting around matching to in the last couple or 3 years .

      I can't address the specifics though. I rarely ever encounter anything beyond ascii data, so I've not bothered with it.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price

      UCS-2 isn't variable width, so I think it was an error to mention it.

      UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages.

      What advantage does UTF-8 have over UTF-16?

      I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.

        I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.

        IIRC, UTF-8 with BOM is unmistakable for iso-8859-*. :)

        I mentioned UCS-2 because it, like UTF-16, is a mess. Except that it’s worse. Anyway, it wasn’t a mistake to mention them together, and I never even vaguely implied that both were or were not fixed width. I said people who jumped on either of those have paid for doing so, because both are problematic. Plus of course UCS-2 doesn't encode most of Unicode.

        As for the many many advantages of UTF-8 over UTF-16, I can hardly begin to list them all. Sorts right numerically. Not dependent on endianness. Compat with ASCII. Smaller. Works with the C syscall interface without retooling. Never uses NULL bytes unless it means them to be NULL bytes. Never lets lazy programmers trick themselves into thinking it is fixed width. Is that enough for you?

        The Wikipedia page has a lot of them.