Honestly the world would be in many small ways better if everyone used 4-byte unicode, but here we are in 2004 and my terminal is a Shift_JIS terminal, and I have documents in utf-8, latin-1 and Shift_JIS, probably a few more encodings too.

Now these documents, you understand, are xml. Well then, what is there to worry about? Xml was written with multiple encodings in mind, all you have to do in put in the xml declaration and there will be happiness in the world of interoperatable data formats.

Also, I have perl 5.8.5. Well then what is there to worry about. Perl 5.8 has the Encode module and the encoding pragma. Localized variants like jperl become redunant. And there was much rejoicing.

But then we get into difficulty. I blythly said that my terminal was Shift_JIS, quietly ignoring the fact that nobody knows what Shift_JIS actually is. The XML/Expat devs got so mad at this that they just replaced support for Shift_JIS with four private Shift_JIS encodings, and a message saying "This is a mess, you sort it out."

Things are nearly as bad over on planet unix, there they're are two incompatible euc-jp encodings.

Well, Ok lets try one of these private encodings. . . Ah they don't encode the "Long swung dash" character nor the "TEL" character. That may be "correct" but it's not very helpful.

OK damn the support for encodings in the XML parser. I have 5.8.5 (and I don't care who knows it). I can encode and decode strings from UTF-8 to any and back again. Ah but there are pits to fall in here too.

First, the encoding pragma sets the encoding output for the script, not for the modules that the script uses.

use encoding shiftjis; # use XML::Parser; my $p=new XML::Parser(Style=>'Debug');

The output from the parser uses the :raw layer, not Shift_JIS. Result: 1001 nonsence kanji fill my screen.

Moreover unless you can control the ProtocolEncoding, and not all subclasses of XML give you that control (think XML::RSS), you can decode your Shift_JIS file to utf-8, but the xml declaration will still say "Shift_JIS" and the XML parser won't know what to do because the parser has never heard of Shift_JIS, and even if it has, the UTF-8 document you are feeding it certainly isn't valid Shift_JIS xml. You're going to have to start munging the xml declaration to get it to work.

And all this is because the two encoding handlers choose to have a battle through your code, and you are left trying to keep them apart.

The world would be so much better if everyone used 4-byte unicode.

Replies are listed 'Best First'.
Re: Encoding is a pain.
by hardburn (Abbot) on Sep 20, 2004 at 14:17 UTC

    Dan Sugalski has some excelent, practical advice on encoding strings: http://www.sidhe.org/~dan/blog/archives/000256.html.

    And no, Unicode is not the ultimate answer. In Dan's own words, it's a partial solution to a problem you may not even have. It's certainly useful and would make things easier if everyone used it, but the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution.

    "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

      . . . the problem of encoding all human written communication is such a big one that I doubt there will ever be a full solution.

      For shame, hardburn! Doubting the human resolve like that. :-)

      Seriously ... I think that we as programmers have been ill-served by the "goal" of backwards-compatability, especially as it pertains to Unicode. There is a good solution to encoding all human written communication:

      1. Gather all possible characters in one place.
      2. Give each one a number from 1 to N, where N is the number of possible characters.
      3. Use M bytes to encode this list, where M = ceil(log2N / 8).

      Ignore the ideas of:

      • keeping specific languages together
      • keeping different languages apart
      • ordering this list for ease of sorting specific languages
      • phonetic vs. written vs. any crap.

      Every character is listed, even if it's a billion characters. If you create a new character, add it at the end and update the appropriate language-specific subsets and collation sets.

      If, like in some Asian languages, you can take two characters and combine them, have a combination character. We do the same thing in English with the correct spelling of "aether". If you need to, have a combine-2, combine-3, etc.

      Then, you can have language-specific subsets (like ASCII, Latin-X, *-JIS, etc.) that refer to that master list. So, ASCII might still be the 127 characters we know and love, but they refer to 234, 12312, 5832, etc.

      Sorting would be handled by the fact that your collation set DWIMs. And, each language subset can have a default collation set, just like Oracle does it.

      I fail to see the problem ...

      ------
      We are the carpenters and bricklayers of the Information Age.

      Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

      I shouldn't have to say this, but any code, unless otherwise stated, is untested

        That's a start on what you need. As usual, the Devil is in the Details. Even if you solve the problem of encoding all human written communication into a bit stream, you still need to process that data. A few things you need to cover are:

        • How do you do basic text transformations, like lc and uc? (Does the language in question even have a concept of upper/lowercase chars?) Should 'í' sort before or after 'i'? (These get harder to implement when you consider combining chars--which is why the Unicode-enabled perl 5.8 runs so much slower than eariler versions).
        • How do you store the glyphs? Assuming a 32-bit char set and each glyph taking up a 16x16 bits in a matrix (black-and-white, no anti-aliasing or other fancy stuff), you would need 2**32 * 2 * 2 = 129 GB to store all possible glyphs. You need a way to break this up so that systems only have to deal with the subset of the data that the typical user cares about. (And I'm not sure that 16x16 bytes would be big enough for many characters).
        • What do you do about old code that does things like s/[A-Z]/[a-z]/g;? ("Ignore it" is often a good answer.)

        Also, IIRC, even with combining chars, 64-bits still isn't enough to cover all human language.

        I hypothisize that computers would have never become so ubiquitous in people's lives if they were first developed in a region with a complex alphabet. The problems of developing the user interface would have been so big that I bet most people would give up and say "well, let's just use them to crunch numbers". We might be fortunate that computers were primarily developed in a region with a relatively simple alphabet.

        "There is no shame in being self-taught, only in not trying to learn in the first place." -- Atrus, Myst: The Book of D'ni.

        I fail to see the problem ...
        Which, itself, is a problem. That you have it isn't unusual (most people have it) but there is a problem.

        People. People are the problem, or at least a large part of it.

        Any solution that starts off "If only everybody did X completely differently than they do now" is doomed to failure as a universal solution. Won't work, and its 'universality' will die the death of a thousand cuts, some technical (and yes, Unicode has technical problems), some social, and some political. It'll be fought because it's sub-optimal in many cases, because people don't like it, because they resent what they see as someone saying "your solution sucks, and so do you -- mine is better", because universal solutions are universally riddled with design compromises, because... because people are ornery. On both (or all) sides of the issue.

        Anyone who thinks they have a universal solution needs to get their ego in check, since it's visible from orbit. All solutions have flaws. Failing to recognize them doesn't mean they aren't there, and acting if they don't exist does noone any favors.

Re: Encoding is a pain.
by demerphq (Chancellor) on Sep 20, 2004 at 14:14 UTC

    Personally I view this problem a little bit differently: Why doesn't XML have a way to handle arbitrary binary data? It seems like there is no way to use XML to carry generic binary data. A good example are the XML tickers here, there are characters possible in a node and other places that cannot be validly embedded in XML. This means that unless we encode all node content as hex or something like it we cannot be sure that we will return valid XML. Since we dont want to do this we have the problem that its relatively easy to embed chatracters in a node that will break many of the XML parsers that consume data from our tickers. I see your language encoding issue as just a variant of this problem. Maybe thats wrong, but thats the way it feels to me.


    ---
    demerphq

      First they ignore you, then they laugh at you, then they fight you, then you win.
      -- Gandhi

      Flux8


      Why doesn't XML have a way to handle arbitrary binary data? ... A good example are the XML tickers here, there are characters possible in a node and other places that cannot be validly embedded in XML.

      Make up your mind, are they characters or binary data? :-)

      Certainly any character which can be represented in HTML should be representable in XML. For example in HTML you could use é for 'é'. In XML, you don't have the handy mnemonic name unless you use a DTD, but you can still represent the character as é - 'é'. The HTML::Entities module can help with the conversion.

        The character versus data distinction is important. XML does have a way to express non-ASCII characters using the DTD as noted. For true binary data CDATA tags almost do it, but they're not foolproof since the binary data could contain sequences that would make the tag look like it ended before it really did. But you could encode using an agreed upon scheme, such as uuencode or base64 encoding and put that in CDATA tags. Ugly, but possible.

      Why doesn't XML have a way to handle arbitrary binary data?

      Well, most likely because some heavy-weight extra stuff would be needed to take care of the non-null probability that "arbitrary binary data" might, just by coincidence, contain a byte sequence that starts with 0x3c "<", ends with 0x3e ">" and has just alphanumerics (and perhaps an unfortunately well-placed slash character) in between.

      Sure, there are bound to be ways to do this, but I think the vast majority of XML users really don't want to go there (not least of all because of what it might do when passed through various network transfer protocols). (update: e.g. how would you "fix" the ubiquitous "crlf/dos-text-mode" transfer methods to handle "arbitrary binary content in XML"? This is tricky enough already just with UC16.)

Re: Encoding is a pain.
by dragonchild (Archbishop) on Sep 20, 2004 at 14:08 UTC
    Other than requiring everyone to use 4-byte unicode (which I agree would make life a lot easier for us grunts!) ... what possible solutions do you have in mind?

    For example, you complain that the output from the parser uses the :raw layer.

    • How would you have the encoding pragma propagate appropriately? Maybe it's a lexically scoped pragma. (Think strict and warnings.)
    • Can you pass an $fh in to XML::Parser? If you can, is its IO layer set correctly if you open it in your script?

    I'm not sure what the right solution is in 5.8.x, let alone in modules (like PDF::Template which uses XML::Parser) that have to support 5.005, 5.6.x, and 5.8.x. (I have nightmares about this, frankly, especially because I speak only Latin-1 languages.)

    You might be interested to see the plethora of discussions that the parrot-dev and perl6-language lists have been having about this. If they are having issues when working on the problem for over a year now, it's amazing that a bunch of modules that ad-hoc'ced together work at all, let along as well as they do!

    ------
    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

Re: Encoding is a pain.
by ambrus (Abbot) on Sep 20, 2004 at 17:33 UTC

    One can always use explicit io layers on filehandles like

    binmode STDIN, ":encoding(iso-8859-2)";
    instead of a use encoding. As far as I see, the real use of use encoding is when you want to embed encoded characters in the script itself, not when you want the script to handle encoded input or output.

    I am, however, not really familiar with encodings, so I'm not sure.

Re: Encoding is a pain.
by theorbtwo (Prior) on Sep 20, 2004 at 18:26 UTC

    You seem to have two major problems. One of them is a perl problem, one is not.

    The perl problem first: encoding doesn't specify anything about what I/O encoding the script should use. It only specifies what encoding the script itself is in. Err... I was wrong here. encoding does set the encodings of STDIN and STDOUT, but as it says: "Note that STDERR WILL NOT be changed." (under USAGE). I assume the debugging output of XML::Parser goes to STDERR.

    Your second problem is, as you've diagnosed, that Shift_JIS is under-specified, and possibly mis-specified as well. So, why is it that your terminal uses Shift_JIS and not utf8?


    Warning: Unless otherwise stated, code is untested. Do not use without understanding. Code is posted in the hopes it is useful, but without warranty. All copyrights are relinquished into the public domain unless otherwise stated. I am not an angel. I am capable of error, and err on a fairly regular basis. If I made a mistake, please let me know (such as by replying to this node).

      I assume the debugging output of XML::Parser goes to STDERR.

      This is where good assumptions beat rtfm(*). The parser pod file tells us that the Debug style "prints out the document in outline form". Sounds like STDOUT to me, but the source tells a different story.

      Anyway, I now have a Deout.pm. Same as Debug.pm but to STDOUT, and my screen is less filled with garbage. This is good.

      * This, of course, only applies to good assumptions (or dodgy manuals)

Re: Encoding is a pain.
by DrHyde (Prior) on Sep 22, 2004 at 09:49 UTC
    "I feel your pain".

    The mess that is character encoding pisses me off no end. I solve the problem by trying to ignore it. Until the rest of the world is willing to Do The Right Thing and just use simple 32-bit (or 64-bit) fixed width characters, I'm not going to go out of my way to accomodate other peoples' stupidity. If this causes me to fail to read your text, or causes me to emit text that someone else can't read, I DON'T CARE.

    Perhaps I'd be a little more accepting of odd character encodings if the crack-smoking loonies who'd invented them had simultaneously produced working libraries for dealing with them. But, for values of "working" that I care about (ie it Just Works), they didn't. I have no interest in setting the splindlebibbit bit in the garbleflux configuration file, I just want to type emails properly in Anglo-Saxon.

    My hatred for XML is a whole other rant :-)

    Incidentally, the fact that some very clever people can't get it to Just Work (hi perl5 developers!) indicates to me that the design is wrong.