Encoding is a pain.

Replies are listed 'Best First'.
Re: Encoding is a pain. by hardburn (Abbot) on Sep 20, 2004 at 14:17 UTC
Dan Sugalski has some excelent, practical advice on encoding strings: http://www.sidhe.org/~dan/blog/archives/000256.html. And no, Unicode is not the ultimate answer. In Dan's own words, it's a partial solution to a problem you may not even have. It's certainly useful and would make things easier if everyone used it, but the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply]
Re^2: Encoding is a pain. by dragonchild (Archbishop) on Sep 20, 2004 at 14:49 UTC
. . . the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution. For shame, hardburn! Doubting the human resolve like that. :-) Seriously ... I think that we as programmers have been ill-served by the "goal" of backwards-compatability, especially as it pertains to Unicode. There is a good solution to encoding all human written communication: Gather all possible characters in one place. Give each one a number from 1 to N, where N is the number of possible characters. Use M bytes to encode this list, where M = ceil(log₂N / 8). Ignore the ideas of: keeping specific languages together keeping different languages apart ordering this list for ease of sorting specific languages phonetic vs. written vs. any crap. Every character is listed, even if it's a billion characters. If you create a new character, add it at the end and update the appropriate language-specific subsets and collation sets. If, like in some Asian languages, you can take two characters and combine them, have a combination character. We do the same thing in English with the correct spelling of "aether". If you need to, have a combine-2, combine-3, etc. Then, you can have language-specific subsets (like ASCII, Latin-X, -JIS, etc.) that refer to that master list. So, ASCII might still be the 127 characters we know and love, but they refer to 234, 12312, 5832, etc. Sorting would be handled by the fact that your collation set DWIMs. And, each language subset can have a default collation set, just like Oracle does it. I fail to see the problem ... ------ We are the carpenters and bricklayers of the Information Age.* Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re^3: Encoding is a pain. by hardburn (Abbot) on Sep 20, 2004 at 15:18 UTC
That's a start on what you need. As usual, the Devil is in the Details. Even if you solve the problem of encoding all human written communication into a bit stream, you still need to process that data. A few things you need to cover are: How do you do basic text transformations, like `lc` and `uc`? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions). How do you store the glyphs? Assuming a 32-bit char set and each glyph taking up a 16x16 bits in a matrix (black-and-white, no anti-aliasing or other fancy stuff), you would need 2*32 2 * 2 = 129 GB to store all possible glyphs. You need a way to break this up so that systems only have to deal with the subset of the data that the typical user cares about. (And I'm not sure that 16x16 bytes would be big enough for many characters). What do you do about old code that does things like `s/[A-Z]/[a-z]/g;`? ("Ignore it" is often a good answer.) Also, IIRC, even with combining chars, 64-bits still isn't enough to cover all human language. I hypothisize that computers would have never become so ubiquitous in people's lives if they were first developed in a region with a complex alphabet. The problems of developing the user interface would have been so big that I bet most people would give up and say "well, let's just use them to crunch numbers". We might be fortunate that computers were primarily developed in a region with a relatively simple alphabet. "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.	[reply] [d/l] [select]
Re^4: Encoding is a pain. by dragonchild (Archbishop) on Sep 20, 2004 at 15:34 UTC
Re^3: Encoding is a pain. by Elian (Parson) on Sep 20, 2004 at 19:09 UTC
I fail to see the problem ... Which, itself, is a problem. That you have it isn't unusual (most people have it) but there is a problem. People. People are the problem, or at least a large part of it. Any solution that starts off "If only everybody did X completely differently than they do now" is doomed to failure as a universal solution. Won't work, and its 'universality' will die the death of a thousand cuts, some technical (and yes, Unicode has technical problems), some social, and some political. It'll be fought because it's sub-optimal in many cases, because people don't like it, because they resent what they see as someone saying "your solution sucks, and so do you -- mine is better", because universal solutions are universally riddled with design compromises, because... because people are ornery. On both (or all) sides of the issue. Anyone who thinks they have a universal solution needs to get their ego in check, since it's visible from orbit. All solutions have flaws. Failing to recognize them doesn't mean they aren't there, and acting if they don't exist does noone any favors.	[reply]
Re^4: Encoding is a pain. by dragonchild (Archbishop) on Sep 20, 2004 at 19:48 UTC
Re^5: Encoding is a pain. by Aristotle (Chancellor) on Sep 20, 2004 at 22:26 UTC
Some notes below your chosen depth have not been shown here
Re^5: Encoding is a pain. by Ytrew (Pilgrim) on Sep 21, 2004 at 15:43 UTC
Re: Encoding is a pain. by demerphq (Chancellor) on Sep 20, 2004 at 14:14 UTC
Personally I view this problem a little bit differently: Why doesn't XML have a way to handle arbitrary binary data? It seems like there is no way to use XML to carry generic binary data. A good example are the XML tickers here, there are characters possible in a node and other places that cannot be validly embedded in XML. This means that unless we encode all node content as hex or something like it we cannot be sure that we will return valid XML. Since we dont want to do this we have the problem that its relatively easy to embed chatracters in a node that will break many of the XML parsers that consume data from our tickers. I see your language encoding issue as just a variant of this problem. Maybe thats wrong, but thats the way it feels to me. --- demerphq _{First they ignore you, then they laugh at you, then they fight you, then you win. -- Gandhi Flux8}	[reply]
Re^2: Encoding is a pain. by grantm (Parson) on Sep 20, 2004 at 22:01 UTC
Why doesn't XML have a way to handle arbitrary binary data? ... A good example are the XML tickers here, there are characters possible in a node and other places that cannot be validly embedded in XML. Make up your mind, are they characters or binary data? :-) Certainly any character which can be represented in HTML should be representable in XML. For example in HTML you could use é for 'é'. In XML, you don't have the handy mnemonic name unless you use a DTD, but you can still represent the character as é - 'é'. The HTML::Entities module can help with the conversion.	[reply]
Re^3: Encoding is a pain. by steves (Curate) on Sep 25, 2004 at 13:49 UTC
The character versus data distinction is important. XML does have a way to express non-ASCII characters using the DTD as noted. For true binary data CDATA tags almost do it, but they're not foolproof since the binary data could contain sequences that would make the tag look like it ended before it really did. But you could encode using an agreed upon scheme, such as uuencode or base64 encoding and put that in CDATA tags. Ugly, but possible.	[reply]
Re^4: Encoding is a pain. by grantm (Parson) on Sep 25, 2004 at 20:08 UTC
Re^2: Encoding is a pain. by graff (Chancellor) on Sep 21, 2004 at 05:27 UTC
Why doesn't XML have a way to handle arbitrary binary data? Well, most likely because some heavy-weight extra stuff would be needed to take care of the non-null probability that "arbitrary binary data" might, just by coincidence, contain a byte sequence that starts with 0x3c "<", ends with 0x3e ">" and has just alphanumerics (and perhaps an unfortunately well-placed slash character) in between. Sure, there are bound to be ways to do this, but I think the vast majority of XML users really don't want to go there (not least of all because of what it might do when passed through various network transfer protocols). (update: e.g. how would you "fix" the ubiquitous "crlf/dos-text-mode" transfer methods to handle "arbitrary binary content in XML"? This is tricky enough already just with UC16.)	[reply]
Re: Encoding is a pain. by dragonchild (Archbishop) on Sep 20, 2004 at 14:08 UTC
Other than requiring everyone to use 4-byte unicode (which I agree would make life a lot easier for us grunts!) ... what possible solutions do you have in mind? For example, you complain that the output from the parser uses the :raw layer. How would you have the encoding pragma propagate appropriately? Maybe it's a lexically scoped pragma. (Think strict and warnings.) Can you pass an $fh in to XML::Parser? If you can, is its IO layer set correctly if you open it in your script? I'm not sure what the right solution is in 5.8.x, let alone in modules (like PDF::Template which uses XML::Parser) that have to support 5.005, 5.6.x, and 5.8.x. (I have nightmares about this, frankly, especially because I speak only Latin-1 languages.) You might be interested to see the plethora of discussions that the parrot-dev and perl6-language lists have been having about this. If they are having issues when working on the problem for over a year now, it's amazing that a bunch of modules that ad-hoc'ced together work at all, let along as well as they do! ------ We are the carpenters and bricklayers of the Information Age. Then there are Damian modules.... sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon.* - flyingmoose I shouldn't have to say this, but any code, unless otherwise stated, is untested	[reply]
Re: Encoding is a pain. by ambrus (Abbot) on Sep 20, 2004 at 17:33 UTC
One can always use explicit io layers on filehandles like `binmode STDIN, ":encoding(iso-8859-2)";` [download] instead of a `use encoding`. As far as I see, the real use of `use encoding` is when you want to embed encoded characters in the script itself, not when you want the script to handle encoded input or output. I am, however, not really familiar with encodings, so I'm not sure.	[reply] [d/l] [select]
Re: Encoding is a pain. by theorbtwo (Prior) on Sep 20, 2004 at 18:26 UTC
You seem to have two major problems. One of them is a perl problem, one is not. ~~The perl problem first: encoding doesn't specify anything about what I/O encoding the script should use. It only specifies what encoding the script itself is in.~~ Err... I was wrong here. encoding does set the encodings of STDIN and STDOUT, but as it says: "Note that STDERR WILL NOT be changed." (under USAGE). I assume the debugging output of XML::Parser goes to STDERR. Your second problem is, as you've diagnosed, that Shift_JIS is under-specified, and possibly mis-specified as well. So, why is it that your terminal uses Shift_JIS and not utf8? Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).	[reply]
Re^2: Encoding is a pain. by zeimusu (Sexton) on Sep 21, 2004 at 15:22 UTC
I assume the debugging output of XML::Parser goes to STDERR. This is where good assumptions beat rtfm(). The parser pod file tells us that the Debug style "prints out the document in outline form". Sounds like STDOUT to me, but the source tells a different story. Anyway, I now have a Deout.pm. Same as Debug.pm but to STDOUT, and my screen is less filled with garbage. This is good. This, of course, only applies to good assumptions (or dodgy manuals)	[reply]
Re: Encoding is a pain. by DrHyde (Prior) on Sep 22, 2004 at 09:49 UTC
"I feel your pain". The mess that is character encoding pisses me off no end. I solve the problem by trying to ignore it. Until the rest of the world is willing to Do The Right Thing and just use simple 32-bit (or 64-bit) fixed width characters, I'm not going to go out of my way to accomodate other peoples' stupidity. If this causes me to fail to read your text, or causes me to emit text that someone else can't read, I DON'T CARE. Perhaps I'd be a little more accepting of odd character encodings if the crack-smoking loonies who'd invented them had simultaneously produced working libraries for dealing with them. But, for values of "working" that I care about (ie it Just Works), they didn't. I have no interest in setting the splindlebibbit bit in the garbleflux configuration file, I just want to type emails properly in Anglo-Saxon. My hatred for XML is a whole other rant :-) Incidentally, the fact that some very clever people can't get it to Just Work (hi perl5 developers!) indicates to me that the design is wrong.	[reply]