in reply to Encoding is a pain.

Dan Sugalski has some excelent, practical advice on encoding strings: http://www.sidhe.org/~dan/blog/archives/000256.html.

And no, Unicode is not the ultimate answer. In Dan's own words, it's a partial solution to a problem you may not even have. It's certainly useful and would make things easier if everyone used it, but the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution.

"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

Replies are listed 'Best First'.
Re^2: Encoding is a pain.
by dragonchild (Archbishop) on Sep 20, 2004 at 14:49 UTC
    . . . the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution.

    For shame, hardburn! Doubting the human resolve like that. :-)

    Seriously ... I think that we as programmers have been ill-served by the "goal" of backwards-compatability, especially as it pertains to Unicode. There is a good solution to encoding all human written communication:

    1. Gather all possible characters in one place.
    2. Give each one a number from 1 to N, where N is the number of possible characters.
    3. Use M bytes to encode this list, where M = ceil(log2N / 8).

    Ignore the ideas of:

    • keeping specific languages together
    • keeping different languages apart
    • ordering this list for ease of sorting specific languages
    • phonetic vs. written vs. any crap.

    Every character is listed, even if it's a billion characters. If you create a new character, add it at the end and update the appropriate language-specific subsets and collation sets.

    If, like in some Asian languages, you can take two characters and combine them, have a combination character. We do the same thing in English with the correct spelling of "aether". If you need to, have a combine-2, combine-3, etc.

    Then, you can have language-specific subsets (like ASCII, Latin-X, *-JIS, etc.) that refer to that master list. So, ASCII might still be the 127 characters we know and love, but they refer to 234, 12312, 5832, etc.

    Sorting would be handled by the fact that your collation set DWIMs. And, each language subset can have a default collation set, just like Oracle does it.

    I fail to see the problem ...

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      That's a start on what you need. As usual, the Devil is in the Details. Even if you solve the problem of encoding all human written communication into a bit stream, you still need to process that data. A few things you need to cover are:

      • How do you do basic text transformations, like lc and uc? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions).
      • How do you store the glyphs? Assuming a 32-bit char set and each glyph taking up a 16x16 bits in a matrix (black-and-white, no anti-aliasing or other fancy stuff), you would need 2**32 * 2 * 2 = 129 GB to store all possible glyphs. You need a way to break this up so that systems only have to deal with the subset of the data that the typical user cares about. (And I'm not sure that 16x16 bytes would be big enough for many characters).
      • What do you do about old code that does things like s/[A-Z]/[a-z]/g;? ("Ignore it" is often a good answer.)

      Also, IIRC, even with combining chars, 64-bits still isn't enough to cover all human language.

      I hypothisize that computers would have never become so ubiquitous in people's lives if they were first developed in a region with a complex alphabet. The problems of developing the user interface would have been so big that I bet most people would give up and say "well, let's just use them to crunch numbers". We might be fortunate that computers were primarily developed in a region with a relatively simple alphabet.

      "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        • How do you do basic text transformations, like lc and uc? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions).

          I think I wasn't as explicit as I should have been.

          • Sorting and casing is a function of the collation set, not the character set.
          • Let the users of a given collation set determine how stuff should sort, case, and all that language-specific crap. Chinese speakers couldn't care less about A vs a vs ae vs a-acute vs whatever. Just as English speakers couldn't care less about diacritics. Don't force them to care.
          • Speed as an issue shouldn't get in the way of good design. Speed is a problem for engineers, not designers. And, as Moore's law keeps moving, it's not really a problem anymore.
        • How do you store the glyphs? . . .

          That's a font issue, not an encoding issue. A font would map to a given subset or collation set. Or, if you want, you could map your font to the master character set. It would be up to the application to figure out what to do with characters that don't appear in the font. There are English fonts that don't have glyphs for all the characters in English. (Symbol is a good example.)

          And, this isn't a radical departure. Fonts are mapped to character sets right now, but the work is done by the display library code. I'm proposing that the font would contain the necessary metadata to describe either the subset/collationset or a subset definition that the font represents. Again, this is the backwards-compatible thing going on. Fonts, historically, have been the provenance of the application. Later, fontmaps were created, but the application still maintained control over how to interpret the fontmap. Instead, why doesn't the application ask the fontmap if it know how to represent character #234211 and, if so, would it please return the appropriate glyph?

        • What do you do about old code that does things like s/[A-Z]/[a-z]/g;? ("Ignore it" is often a good answer.)

          The better answer is to have a collation set that represents 7-bit ASCII. There is no need to have collation sets be the same size as each other. A Chinese character set would probably run to at least 30k characters. An English character set could run as small as 40 or 50, if you ignore case. Some languages might be able to get away with even less.

          If your system is sufficiently general with appropriate views into them, then there is nothing you cannot emulate. Heck, if one were truly masochistic, one might even develop a collation set that would emulate Unicode. :-)

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested

      I fail to see the problem ...
      Which, itself, is a problem. That you have it isn't unusual (most people have it) but there is a problem.

      People. People are the problem, or at least a large part of it.

      Any solution that starts off "If only everybody did X completely differently than they do now" is doomed to failure as a universal solution. Won't work, and its 'universality' will die the death of a thousand cuts, some technical (and yes, Unicode has technical problems), some social, and some political. It'll be fought because it's sub-optimal in many cases, because people don't like it, because they resent what they see as someone saying "your solution sucks, and so do you -- mine is better", because universal solutions are universally riddled with design compromises, because... because people are ornery. On both (or all) sides of the issue.

      Anyone who thinks they have a universal solution needs to get their ego in check, since it's visible from orbit. All solutions have flaws. Failing to recognize them doesn't mean they aren't there, and acting if they don't exist does noone any favors.

        I fail to see the technical problem. There are always problems in adoption of a new technology. VRML is an excellent example of this. It solved a problem, did it well, and died a horrible death.

        I was replying (somewhat facetiously, somewhat seriously) to hardburn who expressed doubt that a full solution would ever be devised. I took that to mean that there were insurmountable technical barriers - that it was NP-complete vs. merely very difficult. So, I proposed a solution-path that would solve the technical problems.

        I am fully aware that the solution I proposed would require a complete rewrite of every single application in existence and how they deal with strings. It, in fact, would require a complete rethinking of how to deal with strings in general. *shrugs* That the problem is undeployable doesn't mean it's unsolveable. Solving a problem, imho, requires the application of four different skills:

        1. Figuring out what problem to solve (Identification)
        2. Figuring out a solution to that problem (Theoretical Analysis)
        3. Figuring out how to implement said solution (Engineering)
        4. Figuring out how to deploy and encourage adoption of said solution (Politics)

        The problem has been identified for years. I proposed a solution and a high-level implentation. I have no idea how to go about encouraging deployment and adoption. It's a bootstrapping problem, as far as I can tell.

        ------
        We are the carpenters and bricklayers of the Information Age.

        Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

        I shouldn't have to say this, but any code, unless otherwise stated, is untested