in reply to Encoding is a pain.
Dan Sugalski has some excelent, practical advice on encoding strings: http://www.sidhe.org/~dan/blog/archives/000256.html.
And no, Unicode is not the ultimate answer. In Dan's own words, it's a partial solution to a problem you may not even have. It's certainly useful and would make things easier if everyone used it, but the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution.
"There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Encoding is a pain.
by dragonchild (Archbishop) on Sep 20, 2004 at 14:49 UTC | |
For shame, hardburn! Doubting the human resolve like that. :-) Seriously ... I think that we as programmers have been ill-served by the "goal" of backwards-compatability, especially as it pertains to Unicode. There is a good solution to encoding all human written communication: Ignore the ideas of: Every character is listed, even if it's a billion characters. If you create a new character, add it at the end and update the appropriate language-specific subsets and collation sets. If, like in some Asian languages, you can take two characters and combine them, have a combination character. We do the same thing in English with the correct spelling of "aether". If you need to, have a combine-2, combine-3, etc. Then, you can have language-specific subsets (like ASCII, Latin-X, *-JIS, etc.) that refer to that master list. So, ASCII might still be the 127 characters we know and love, but they refer to 234, 12312, 5832, etc. Sorting would be handled by the fact that your collation set DWIMs. And, each language subset can have a default collation set, just like Oracle does it. I fail to see the problem ... ------
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested | [reply] |
by hardburn (Abbot) on Sep 20, 2004 at 15:18 UTC | |
That's a start on what you need. As usual, the Devil is in the Details. Even if you solve the problem of encoding all human written communication into a bit stream, you still need to process that data. A few things you need to cover are: Also, IIRC, even with combining chars, 64-bits still isn't enough to cover all human language. I hypothisize that computers would have never become so ubiquitous in people's lives if they were first developed in a region with a complex alphabet. The problems of developing the user interface would have been so big that I bet most people would give up and say "well, let's just use them to crunch numbers". We might be fortunate that computers were primarily developed in a region with a relatively simple alphabet. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni. | [reply] [d/l] [select] |
by dragonchild (Archbishop) on Sep 20, 2004 at 15:34 UTC | |
------
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested | [reply] [d/l] |
by Elian (Parson) on Sep 20, 2004 at 19:09 UTC | |
I fail to see the problem ...Which, itself, is a problem. That you have it isn't unusual (most people have it) but there is a problem. People. People are the problem, or at least a large part of it. Any solution that starts off "If only everybody did X completely differently than they do now" is doomed to failure as a universal solution. Won't work, and its 'universality' will die the death of a thousand cuts, some technical (and yes, Unicode has technical problems), some social, and some political. It'll be fought because it's sub-optimal in many cases, because people don't like it, because they resent what they see as someone saying "your solution sucks, and so do you -- mine is better", because universal solutions are universally riddled with design compromises, because... because people are ornery. On both (or all) sides of the issue. Anyone who thinks they have a universal solution needs to get their ego in check, since it's visible from orbit. All solutions have flaws. Failing to recognize them doesn't mean they aren't there, and acting if they don't exist does noone any favors. | [reply] |
by dragonchild (Archbishop) on Sep 20, 2004 at 19:48 UTC | |
I was replying (somewhat facetiously, somewhat seriously) to hardburn who expressed doubt that a full solution would ever be devised. I took that to mean that there were insurmountable technical barriers - that it was NP-complete vs. merely very difficult. So, I proposed a solution-path that would solve the technical problems. I am fully aware that the solution I proposed would require a complete rewrite of every single application in existence and how they deal with strings. It, in fact, would require a complete rethinking of how to deal with strings in general. *shrugs* That the problem is undeployable doesn't mean it's unsolveable. Solving a problem, imho, requires the application of four different skills: The problem has been identified for years. I proposed a solution and a high-level implentation. I have no idea how to go about encouraging deployment and adoption. It's a bootstrapping problem, as far as I can tell. ------
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested | [reply] |
by Aristotle (Chancellor) on Sep 20, 2004 at 22:26 UTC | |
by dragonchild (Archbishop) on Sep 21, 2004 at 00:34 UTC | |
by Ytrew (Pilgrim) on Sep 21, 2004 at 15:43 UTC | |