Re: Best Way to Get Length of UTF-8 String in Bytes?

Replies are listed 'Best First'.
Re^2: Best Way to Get Length of UTF-8 String in Bytes? by Jim (Curate) on Apr 24, 2011 at 01:08 UTC
The problem is that it depends on one important thing: Whatever in the world do you want it for, anyway? To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of. It’s quite possible that there might be a better approach you just don’t know about. Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column? Thank you for your reply. I appreciate it. Jim	[reply]
Re^3: Best Way to Get Length of UTF-8 String in Bytes? by John M. Dlugosz (Monsignor) on Apr 24, 2011 at 11:42 UTC
It is an intentional properly of UTF-8 encoding that although variable length, you can easily figure out that you're in the middle of a character and where whole characters begin. Continuation bytes always start with the bits 10xxxxxx. Single-byte characters always have a high bit of 0 (0xxxxxxx), and multi-byte characters always start with a byte that has as many leading 1 bits as there are bytes total: 110xxxxx for two bytes, 1110xxxx for three bytes, etc. So, start at position N of the utf-8 encoded byte string that is the maximum length. While the byte at position N is a continuation byte, decrement N. Now you can truncate to length N. To prevent clipping the accents off a base character or something like that, you can furthermore look at the whole character beginning at N. Check the Unicode Properties to see if it's a modifier or something. If it is, decrement N again repeat.	[reply]
Re^3: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 24, 2011 at 04:06 UTC
To fit UTF-8 text into a column in a database management system that does not quantify the size of text in characters, but in bytes. The database management system is not relational, not conventional, and probably not one you've ever heard of. I feel lucky that all my database experiences in recent memory have involved ones that had no fixed limits on any of their sizes. One still had to encode/decode to UTF-8, but I didn’t have your particular problem. It’s quite possible that there might be a better approach you just don’t know about. Very possible. So if I have a VARCHAR column limit of 32,767 bytes, not characters, how do I trim a UTF-8 string to ensure I don't wrongly try to put more than 32,767 bytes worth of it into a column? Well, Jim, that’s quite a pickle. I think I’m going to renege about my advice in perlfunc. If you encode to UTF‑8 bytes, then you won’t know whether and most especially, where to truncate your string, because you’ve lost the character information. And you really have to have character information, plus more, too. It is vaguely possible that you might be able to arrange something with the `\C` regex escape for an octet, somehow combining a bytewise assertion of `\A\C{0,32767}` with one that fits a charwise `\A.` or better yet a grapheme-wise `^\X` within that. But that isn’t the approach I finally ended up using, because that sounded too complicated and messy. I decided to do something really simple. My “Pac‑Man^®” algorithm is simple: chop until short enough. More specifically, remove the last grapheme until the string has fewer than your maximum bytes. You can do a bit better than blind naïveté by realizing that even at maximum efficiency of one byte per character (pure ASCII), if the actual char length is more than the maximum allowed byte length, you can pre-truncate the character count. That way you don’t come slowly pac-manning back from 100k strings. There are a few things complicating your life. Just as you do not wish to chop off a byte in the middle of a character, neither do you want to chop of a character in the middle of a grapheme. You don’t want "\cM\cJ" to get split if that’s in your data, and you very most especially do not wish to lose a Grapheme_Extend code point like a diacritic or an underline/overline off of its Grapheme_Base code point. What I ended up doing, therefore, was this: `require bytes; if (length($s) > $MAX) { substr($s, $MAX) = ""; } $s =~ s/\X\z// until $MAX > bytes::length($s);` [download] That assumes that the strings are Unicode strings with their UTF‑8 flags on. I have done the wickedness of using the `bytes` namespace. This breaks the encapsulation of abstract characters. I am relying on knowing that the internal byte length is in UTF-8. If that changes — and there is no guarantee at all that it will not do so someday — then this code will break. Also, it is critical that it be `require`d, not `use`d. You do not want byte semantics for your operations; you just want to be able to get a `bytes::length` on its own. I haven’t benchmarked this against doing it the “pure” way with a bunch of calls to `encode_utf8`. You might want to do that. But what I have done is run the algorithm against a bunch of strings: some in NFD form, some in NFC; some French and Chinese; some with multiple combining characters; some even with very fat graphemes from up in the SMP with their marks (math letters with overlines). I ran it with MAX == 25 bytes, but I see no reason why it shouldn’t work set to your own 32,767. Here are some examples of the traces: String was <NFD: tête‐à‐tête tête‐à‐tête> start string has graphlen 28, charlen 34, bytelen 48 CHARLEN 34 > 25, truncating to 25 CHARS bytelen 33 still too long, chopping last grapheme deleted grapheme <e> U+0065, charlen -1, bytelen -1 bytelen 32 still too long, chopping last grapheme deleted grapheme <t> U+0074, charlen -1, bytelen -1 bytelen 31 still too long, chopping last grapheme deleted grapheme <ê> U+0065.0302, charlen -2, bytelen -3 bytelen 28 still too long, chopping last grapheme deleted grapheme <t> U+0074, charlen -1, bytelen -1 bytelen 27 still too long, chopping last grapheme deleted grapheme < > U+0020, charlen -1, bytelen -1 bytelen 26 still too long, chopping last grapheme deleted grapheme <e> U+0065, charlen -1, bytelen -1 final string has graphlen 15, charlen 18, bytelen 25 Trunc'd is <NFD: tête‐à‐têt> String was <NFD 蓝 lán and 绿 lǜ> start string has graphlen 18, charlen 21, bytelen 28 bytelen 28 still too long, chopping last grapheme deleted grapheme <ǜ> U+0075.0308.0300, charlen -3, bytelen -5 final string has graphlen 17, charlen 18, bytelen 23 Trunc'd is <NFD 蓝 lán and 绿 l> String was <Chinese: 青天,白日,满地红> start string has graphlen 18, charlen 18, bytelen 36 bytelen 36 still too long, chopping last grapheme deleted grapheme <红> U+7EA2, charlen -1, bytelen -3 bytelen 33 still too long, chopping last grapheme deleted grapheme <地> U+5730, charlen -1, bytelen -3 bytelen 30 still too long, chopping last grapheme deleted grapheme <满> U+6EE1, charlen -1, bytelen -3 bytelen 27 still too long, chopping last grapheme deleted grapheme <,> U+FF0C, charlen -1, bytelen -3 final string has graphlen 14, charlen 14, bytelen 24 Trunc'd is <Chinese: 青天,白日> String was <NFD: hã̂ç̌k hã̂ç̌k hẫç̌k hẫç̌k> start string has graphlen 24, charlen 35, bytelen 51 CHARLEN 35 > 25, truncating to 25 CHARS bytelen 37 still too long, chopping last grapheme deleted grapheme <ç̌> U+00E7.030C, charlen -2, bytelen -4 bytelen 33 still too long, chopping last grapheme deleted grapheme <ẫ> U+1EAB, charlen -1, bytelen -3 bytelen 30 still too long, chopping last grapheme deleted grapheme <h> U+0068, charlen -1, bytelen -1 bytelen 29 still too long, chopping last grapheme deleted grapheme < > U+0020, charlen -1, bytelen -1 bytelen 28 still too long, chopping last grapheme deleted grapheme <k> U+006B, charlen -1, bytelen -1 bytelen 27 still too long, chopping last grapheme deleted grapheme <ç̌> U+0063.0327.030C, charlen -3, bytelen -5 final string has graphlen 12, charlen 16, bytelen 22 Trunc'd is <NFD: hã̂ç̌k hã̂> String was <𝐂̅ = sqrt[𝐀̅² + 𝐁̅²]> start string has graphlen 17, charlen 20, bytelen 34 bytelen 34 still too long, chopping last grapheme deleted grapheme <]> U+005D, charlen -1, bytelen -1 bytelen 33 still too long, chopping last grapheme deleted grapheme <²> U+00B2, charlen -1, bytelen -2 bytelen 31 still too long, chopping last grapheme deleted grapheme <𝐁̅> U+1D401.0305, charlen -2, bytelen -6 final string has graphlen 14, charlen 16, bytelen 25 Trunc'd is <𝐂̅ = sqrt[𝐀̅² + > Here’s the complete program. I’ve uniquoted the strings, so the program itself is actually in pure ASCII, which means I can put in in `<code>` tags here instead of messing around with icky `<pre>` and weird escapes. You can download it easily enough if you want, play with the numbers and all, but the heart of algorithm is just the one liner to throw out the last grapheme and check the byte length. You’ll see that I’ve left the debugging in. #!/usr/bin/env perl use 5.12.0; use strict; use autodie; use warnings; use utf8; use open qw<:std :utf8>; use charnames qw< :full >; require bytes; my $MAX_BYTES = 25; my ($MIN_BPC, $MAX_BPC) = (1, 4); my $MAX_CHARS = $MAX_BYTES / $MIN_BPC; sub bytelen(_) { require bytes; return bytes::length($_[0]); } sub graphlen(_) { my $count = 0; $count++ while $_[0] =~ /\X/g; return $count; } sub charlen(_) { return length($_[0]); } sub shorten(_) { my $s = $_[0]; printf "\tstart string has graphlen %d, charlen %d, bytelen %d\n", graphlen($s), charlen($s), bytelen($s); if (charlen($s) > $MAX_CHARS) { printf "\tCHARLEN %d > %d, truncating to %d CHARS\n", length($s), $MAX_BYTES, $MAX_CHARS; substr($s, $MAX_CHARS) = ""; } while (bytelen($s) > $MAX_BYTES) { printf "\tbytelen %d still too long, chopping last grapheme\n" +, bytes::length($s); $s =~ s/(\X)\z//; printf "\tdeleted grapheme <%s> U+%v04X, charlen -%d, bytelen +-%d\n", $1, $1, length($1), bytes::length($1); } printf "\tfinal string has graphlen %d, charlen %d, bytelen %d\n", graphlen($s), charlen($s), bytelen($s); return $s; } my @strings = ( "this lines starts a bit too long", "NFC: cr\N{LATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LET +TER U WITH CIRCUMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e et cr\N{L +ATIN SMALL LETTER E WITH GRAVE}me br\N{LATIN SMALL LETTER U WITH CIRC +UMFLEX}l\N{LATIN SMALL LETTER E WITH ACUTE}e", "NFC: t\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}te\N{HYPHEN}\N{LATI +N SMALL LETTER A WITH GRAVE}\N{HYPHEN}t\N{LATIN SMALL LETTER E WITH C +IRCUMFLEX}te t\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}te\N{HYPHEN}\N{ +LATIN SMALL LETTER A WITH GRAVE}\N{HYPHEN}t\N{LATIN SMALL LETTER E WI +TH CIRCUMFLEX}te", "NFD: cre\N{COMBINING GRAVE ACCENT}me bru\N{COMBINING CIRCUMFLEX A +CCENT}le\N{COMBINING ACUTE ACCENT}e et cre\N{COMBINING GRAVE ACCENT}m +e bru\N{COMBINING CIRCUMFLEX ACCENT}le\N{COMBINING ACUTE ACCENT}e", "NFD: te\N{COMBINING CIRCUMFLEX ACCENT}te\N{HYPHEN}a\N{COMBINING G +RAVE ACCENT}\N{HYPHEN}te\N{COMBINING CIRCUMFLEX ACCENT}te te\N{COMBIN +ING CIRCUMFLEX ACCENT}te\N{HYPHEN}a\N{COMBINING GRAVE ACCENT}\N{HYPHE +N}te\N{COMBINING CIRCUMFLEX ACCENT}te", "NFC \N{U+84DD} l\N{LATIN SMALL LETTER A WITH ACUTE}n and \N{U+7EF +F} l\N{LATIN SMALL LETTER U WITH DIAERESIS AND GRAVE}", "NFD \N{U+84DD} la\N{COMBINING ACUTE ACCENT}n and \N{U+7EFF} lu\N{ +COMBINING DIAERESIS}\N{COMBINING GRAVE ACCENT}", "XXX NFC q\N{LATIN SMALL LETTER I WITH MACRON}ng ti\N{LATIN SMALL +LETTER A WITH MACRON}n, b\N{LATIN SMALL LETTER A WITH ACUTE}i r\N{LAT +IN SMALL LETTER I WITH GRAVE}, m\N{LATIN SMALL LETTER A WITH CARON}n +d\N{LATIN SMALL LETTER I WITH GRAVE} h\N{LATIN SMALL LETTER O WITH AC +UTE}ng", "XXX NFD qi\N{COMBINING MACRON}ng tia\N{COMBINING MACRON}n, ba\N{C +OMBINING ACUTE ACCENT}i ri\N{COMBINING GRAVE ACCENT}, ma\N{COMBINING +CARON}n di\N{COMBINING GRAVE ACCENT} ho\N{COMBINING ACUTE ACCENT}ng", "Chinese: \N{U+9752}\N{U+5929}\N{FULLWIDTH COMMA}\N{U+767D}\N{U+65 +E5}\N{FULLWIDTH COMMA}\N{U+6EE1}\N{U+5730}\N{U+7EA2}", "normal \N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN SMALL + LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN SMALL +LETTER E} normal \N{FULLWIDTH LATIN SMALL LETTER W}\N{FULLWIDTH LATIN + SMALL LETTER I}\N{FULLWIDTH LATIN SMALL LETTER D}\N{FULLWIDTH LATIN +SMALL LETTER E}", "NFC: h\N{LATIN SMALL LETTER A WITH TILDE}\N{COMBINING CIRCUMFLEX +ACCENT}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k ha\N +{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA +}\N{COMBINING CARON}k h\N{LATIN SMALL LETTER A WITH CIRCUMFLEX AND TI +LDE}\N{LATIN SMALL LETTER C WITH CEDILLA}\N{COMBINING CARON}k ha\N{CO +MBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N +{COMBINING CARON}k", "NFD: ha\N{COMBINING TILDE}\N{COMBINING CIRCUMFLEX ACCENT}c\N{COMB +INING CEDILLA}\N{COMBINING CARON}k ha\N{COMBINING TILDE}\N{COMBINING +CIRCUMFLEX ACCENT}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k ha\N{COM +BINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE}c\N{COMBINING CEDILLA}\N{ +COMBINING CARON}k ha\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING TILDE +}c\N{COMBINING CEDILLA}\N{COMBINING CARON}k", "\N{MATHEMATICAL BOLD CAPITAL C}\N{COMBINING OVERLINE} = sqrt[\N{M +ATHEMATICAL BOLD CAPITAL A}\N{COMBINING OVERLINE}\N{SUPERSCRIPT TWO} ++ \N{MATHEMATICAL BOLD CAPITAL B}\N{COMBINING OVERLINE}\N{SUPERSCRIPT + TWO}]", "4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\ +N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE} 4\N{FRACTION SLASH}3\N{INVIS +IBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIP +T THREE} 4\N{FRACTION SLASH}3\N{INVISIBLE TIMES}\N{GREEK SMALL LETTER + PI}\N{INVISIBLE TIMES}r\N{SUPERSCRIPT THREE} 4\N{FRACTION SLASH}3\N{ +INVISIBLE TIMES}\N{GREEK SMALL LETTER PI}\N{INVISIBLE TIMES}r\N{SUPER +SCRIPT THREE}", ); printf "MAX byte length is %d\n\n", $MAX_BYTES; for my $line (@strings) { chomp $line; say "String was <$line>"; my $trunk = shorten($line); say "Trunc'd is <$trunk>\n"; } exit 0; [download] There are other ways to go about this, but this seemed to work well enough. Hope it helps. Oh, BTW, if you really want to do print-columns instead of graphemes, look to the Unicode::GCString module; it comes with Unicode::LineBreak. Both are highly recommended. I use them in my unifmt program to do intelligent linebreaking of Asian text per UAX#14.	[reply] [d/l] [select]
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by Jim (Curate) on Apr 24, 2011 at 05:17 UTC
Amazing! Thank you, thank you, thank you, Tom! It's a lot for me to assimilate with my limited intellect and meager Perl skills, but I will certainly try. My real objective—as awful as it sounds—is to split arbitrarily long UTF-8 strings into chunks of 32,767-byte substrings and distribute them into a series of VARCHAR columns. It's horrible, I know, but if I don't do it, another Ivfhny Onfvp .ARG programmer will—and much more badly than I. Jim	[reply]
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 05:43 UTC
I see `use bytes;` without any `utf8::upgrade` or `utf8::downgrade`, and that usually indicates code that suffers from "The Unicode Bug". `sub bytelen(_) { require bytes; return bytes::length($_[0]); }` [download] should be `sub utf8len(_) { utf8::upgrade($_[0]); require bytes; return bytes::length($_[0]); }` [download] Or the same without bytes: `sub utf8len(_) { utf8::upgrade($_[0]); Encode::_utf8_off($_[0]); my $utf8len = length($_[0]); Encode::_utf8_on($_[0]); return $utf8len; }` [download] Update: Added non-bytes alternative.	[reply] [d/l] [select]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 24, 2011 at 06:01 UTC
Re^6: Best Way to Get Length of UTF-8 String in Bytes? by Anonymous Monk on Apr 24, 2011 at 06:04 UTC
Re^6: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 24, 2011 at 06:06 UTC
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by Jim (Curate) on Apr 26, 2011 at 21:04 UTC
I've studied the demonstration script and I understand everything it's doing, except for this bit: `my $MAX_BYTES = 25; my ($MIN_BPC, $MAX_BPC) = (1, 4); my $MAX_CHARS = $MAX_BYTES / $MIN_BPC;` [download] What's going on here? `$MAX_CHARS` will always be set to the value of `$MAX_BYTES`, and `$MAX_BPC` seems to serve no function. Am I right? Also, what happens if, in the initial truncation of the string done using `substr()` as an lvalue, we land smack dab in the middle of a grapheme, and the rightmost character in the resultant truncated string is, by itself, a valid grapheme? `D:\>perl -CO -Mcharnames=:full -wE "$MAX = 4; $cafe = qq/cafe\N{COMBIN +ING ACUTE ACCENT}/; say $cafe; substr($cafe, $MAX) = ''; say $cafe;" +> cafe.txt D:\>` [download] Here's the text in the output file cafe.txt: café cafe (Thanks again for this very helpful script!)	[reply] [d/l] [select]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 27, 2011 at 05:16 UTC
Re^6: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 27, 2011 at 07:18 UTC
Some notes below your chosen depth have not been shown here
Re^2: Best Way to Get Length of UTF-8 String in Bytes? by BrowserUk (Patriarch) on Apr 24, 2011 at 12:53 UTC
Whatever in the world do you want it for, anyway? Obtaining knowledge of the storage requirements for a piece of data does not seem such an unusual requirement to me. Whether it is for sizing a buffer for interfacing to a C (other) language library; or for length-prefixing a packet for a transmission protocol; or for indexing a file; or any of a dozen other legitimate uses. Indeed, given that this information is readily & trivially available to Perl: `#! perl -slw use strict; use Devel::Peek; use charnames qw( :full ); my $text = "\N{LATIN SMALL LETTER E WITH ACUTE}"; print length $text; Dump $text; __END__ C:\test>junk.pl 1 SV = PVMG(0x27fea8) at 0x237ab8 REFCNT = 1 FLAGS = (PADMY,SMG,POK,pPOK,UTF8) IV = 0 NV = 0 PV = 0x3c78e8 "\303\251"\0 [UTF8 "\x{e9}"] CUR = 2 LEN = 8 MAGIC = 0x23c648 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 1` [download] the absence of a simple built-in mechanism for obtaining it seems both remiss and arbitrary. But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions. Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense. With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^3: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 24, 2011 at 16:13 UTC
But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions. Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense. With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words. Yes and no. The no parts are that you seem to have confused the UTF‑8 with Unicode. Unicode is here to stay, and does not share in UTF‑8’s flaws. But realistically, you are simply never going to get rid of UTF‑8 as a transfer format. Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? That will never happen. The yes part is that I agree that `int` is the new `char`. No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points. Because Unicode is a 21‑bit charset, that means you need 32‑bit characters. It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price, since UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages. At least Perl didn’t make that particular brain-damaged mistake! It could have been much worse. UTF‑8 is now the de facto standard, and I am very glad that Perl didn’t do the stupid thing that Java and so many others did: just try matching non-BMP code points in character classes, for example. Can’t do it in the UTF-16 languages. Oops! :(	[reply] [d/l] [select]
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by BrowserUk (Patriarch) on Apr 24, 2011 at 19:23 UTC
Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? I think that this 'space' argument is a complete crock. Firstly, if saving diskspace is the primary criteria--even though disk space is cheaper today than ever before, then gzipping or even bzip2ing is far, far more effective than variable-length characters. Even if you expand UTF-8 to UTF-32 prior to gzipping, the resultant filesize is hardly affected. Secondly, if saving ram is the criteria then load the data zipped and expand it in memory. Disk is slow and ram is fast. The cost of unpacking on demand is offset, almost if not entirely, by the time saved reading smaller volumes from disk. Finally, if the 'new' encoding scheme addressed the problem of having the data itself identify what it is--through (say) an expanded BOM mark mechanism or similar-- then there would be no need to convert legacy data. It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price Maybe so. But then when the UCS-2 schema was adopted, the full Unicode standard was nothing more than a twinkle in the eyes of its beholders. And those that adopted it had 10 or so years of workable solution before Unicode got its act together. (In so far as it has:) I remember the Byte magazine article entitled somthing like "Universal Character Set versus Unicode" circa. 1993? From memory, it came down pretty clearly on the side of UCS at that time. UCS may not be the best solution now, but for the best part of 15 years it offered a solution that Unicode is only getting around matching to in the last couple or 3 years . I can't address the specifics though. I rarely ever encounter anything beyond ascii data, so I've not bothered with it. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^4: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 25, 2011 at 05:05 UTC
It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price UCS-2 isn't variable width, so I think it was an error to mention it. UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages. What advantage does UTF-8 have over UTF-16? I can only think of one UTF-16 has that UTF-8 doesn't: It's not mistakable for iso-8859-*.	[reply]
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by Anonymous Monk on Apr 25, 2011 at 06:40 UTC
Re^6: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 25, 2011 at 06:59 UTC
Re^5: Best Way to Get Length of UTF-8 String in Bytes? by tchrist (Pilgrim) on Apr 27, 2011 at 05:30 UTC
Re^6: Best Way to Get Length of UTF-8 String in Bytes? by ikegami (Patriarch) on Apr 27, 2011 at 07:06 UTC

Whatever in the world do you want it for, anyway?

café cafe