Re^3: Encoding is a pain.

That's a start on what you need. As usual, the Devil is in the Details. Even if you solve the problem of encoding all human written communication into a bit stream, you still need to process that data. A few things you need to cover are:

How do you do basic text transformations, like lc and uc? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions).
How do you store the glyphs? Assuming a 32-bit char set and each glyph taking up a 16x16 bits in a matrix (black-and-white, no anti-aliasing or other fancy stuff), you would need 2**32 * 2 * 2 = 129 GB to store all possible glyphs. You need a way to break this up so that systems only have to deal with the subset of the data that the typical user cares about. (And I'm not sure that 16x16 bytes would be big enough for many characters).
What do you do about old code that does things like s/[A-Z]/[a-z]/g;? ("Ignore it" is often a good answer.)

Also, IIRC, even with combining chars, 64-bits still isn't enough to cover all human language.

I hypothisize that computers would have never become so ubiquitous in people's lives if they were first developed in a region with a complex alphabet. The problems of developing the user interface would have been so big that I bet most people would give up and say "well, let's just use them to crunch numbers". We might be fortunate that computers were primarily developed in a region with a relatively simple alphabet.

"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Comment on Re^3: Encoding is a pain. Select or Download Code

Replies are listed 'Best First'.
Re^4: Encoding is a pain. by dragonchild (Archbishop) on Sep 20, 2004 at 15:34 UTC
How do you do basic text transformations, like lc and uc? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions). I think I wasn't as explicit as I should have been. Sorting and casing is a function of the collation set, not the character set. Let the users of a given collation set determine how stuff should sort, case, and all that language-specific crap. Chinese speakers couldn't care less about A vs a vs ae vs a-acute vs whatever. Just as English speakers couldn't care less about diacritics. Don't force them to care. Speed as an issue shouldn't get in the way of good design. Speed is a problem for engineers, not designers. And, as Moore's law keeps moving, it's not really a problem anymore. How do you store the glyphs? . . . That's a font issue, not an encoding issue. A font would map to a given subset or collation set. Or, if you want, you could map your font to the master character set. It would be up to the application to figure out what to do with characters that don't appear in the font. There are English fonts that don't have glyphs for all the characters in English. (Symbol is a good example.) And, this isn't a radical departure. Fonts are mapped to character sets right now, but the work is done by the display library code. I'm proposing that the font would contain the necessary metadata to describe either the subset/collationset or a subset definition that the font represents. Again, this is the backwards-compatible thing going on. Fonts, historically, have been the provenance of the application. Later, fontmaps were created, but the application still maintained control over how to interpret the fontmap. Instead, why doesn't the application ask the fontmap if it know how to represent character #234211 and, if so, would it please return the appropriate glyph? What do you do about old code that does things like `s/[A-Z]/[a-z]/g;`? ("Ignore it" is often a good answer.) The better answer is to have a collation set that represents 7-bit ASCII. There is no need to have collation sets be the same size as each other. A Chinese character set would probably run to at least 30k characters. An English character set could run as small as 40 or 50, if you ignore case. Some languages might be able to get away with even less. If your system is sufficiently general with appropriate views into them, then there is nothing you cannot emulate. Heck, if one were truly masochistic, one might even develop a collation set that would emulate Unicode. :-) ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply] [d/l]

Replies are listed 'Best First'.

Re^4: Encoding is a pain.
by dragonchild (Archbishop) on Sep 20, 2004 at 15:34 UTC

How do you do basic text transformations, like lc and uc? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions).
I think I wasn't as explicit as I should have been.
- Sorting and casing is a function of the collation set, not the character set.
- Let the users of a given collation set determine how stuff should sort, case, and all that language-specific crap. Chinese speakers couldn't care less about A vs a vs ae vs a-acute vs whatever. Just as English speakers couldn't care less about diacritics. Don't force them to care.
- Speed as an issue shouldn't get in the way of good design. Speed is a problem for engineers, not designers. And, as Moore's law keeps moving, it's not really a problem anymore.
How do you store the glyphs? . . .
That's a font issue, not an encoding issue. A font would map to a given subset or collation set. Or, if you want, you could map your font to the master character set. It would be up to the application to figure out what to do with characters that don't appear in the font. There are English fonts that don't have glyphs for all the characters in English. (Symbol is a good example.)
And, this isn't a radical departure. Fonts are mapped to character sets right now, but the work is done by the display library code. I'm proposing that the font would contain the necessary metadata to describe either the subset/collationset or a subset definition that the font represents. Again, this is the backwards-compatible thing going on. Fonts, historically, have been the provenance of the application. Later, fontmaps were created, but the application still maintained control over how to interpret the fontmap. Instead, why doesn't the application ask the fontmap if it know how to represent character #234211 and, if so, would it please return the appropriate glyph?
What do you do about old code that does things like s/[A-Z]/[a-z]/g;? ("Ignore it" is often a good answer.)
The better answer is to have a collation set that represents 7-bit ASCII. There is no need to have collation sets be the same size as each other. A Chinese character set would probably run to at least 30k characters. An English character set could run as small as 40 or 50, if you ignore case. Some languages might be able to get away with even less.
If your system is sufficiently general with appropriate views into them, then there is nothing you cannot emulate. Heck, if one were truly masochistic, one might even develop a collation set that would emulate Unicode. :-)

------
We are the carpenters and bricklayers of the Information Age.

Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

I shouldn't have to say this, but any code, unless otherwise stated, is untested

[reply]
[d/l]