comment on

But then, this is just another in a long list of reasons why the whole Unicode thing should be dumped in favour of a standard that throws away all the accumulated crude of transitional standards and yesteryears physical and financial restrictions.
Given the cheapness of today's ram, variable length encodings make no sense given the restrictions and overheads they impose. And any 'standard' that means that it is impossible to tell what a piece of data actually represents without reference to some external metadata is an equal nonsense.
With luck, the current mess will be consigned to the bitbucket of history along with all the other evolutionary dead ends like 6-bit bytes and 36-bit words.

Yes and no.

The no parts are that you seem to have confused the UTF‑8 with Unicode. Unicode is here to stay, and does not share in UTF‑8’s flaws. But realistically, you are simply never going to get rid of UTF‑8 as a transfer format. Do you truly think that people are going to put up with the near-quadrupling in space that the gigabytes and gigabytes of large corpora would require if they were stored or transfered as UTF‑32? That will never happen.

The yes part is that I agree that int is the new char. No character data type of fixed width should ever be smaller than the number of bits needed to store any and all possible Unicode code points. Because Unicode is a 21‑bit charset, that means you need 32‑bit characters.

It also means that everyone who jumped on the broken UCS‑2 or UTF‑16 bandwagon is paying a really wicked price, since UTF‑16 has all the disadvantages of UTF‑8 but none of its advantages.

At least Perl didn’t make that particular brain-damaged mistake! It could have been much worse. UTF‑8 is now the de facto standard, and I am very glad that Perl didn’t do the stupid thing that Java and so many others did: just try matching non-BMP code points in character classes, for example. Can’t do it in the UTF-16 languages. Oops! :(

In reply to Re^3: Best Way to Get Length of UTF-8 String in Bytes? by tchrist
in thread Best Way to Get Length of UTF-8 String in Bytes? by Jim

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.