Honestly the world would be in many small ways better if everyone used 4-byte unicode, but here we are in 2004 and my terminal is a Shift_JIS terminal, and I have documents in utf-8, latin-1 and Shift_JIS, probably a few more encodings too.

Now these documents, you understand, are xml. Well then, what is there to worry about? Xml was written with multiple encodings in mind, all you have to do in put in the xml declaration and there will be happiness in the world of interoperatable data formats.

Also, I have perl 5.8.5. Well then what is there to worry about. Perl 5.8 has the Encode module and the encoding pragma. Localized variants like jperl become redunant. And there was much rejoicing.

But then we get into difficulty. I blythly said that my terminal was Shift_JIS, quietly ignoring the fact that nobody knows what Shift_JIS actually is. The XML/Expat devs got so mad at this that they just replaced support for Shift_JIS with four private Shift_JIS encodings, and a message saying "This is a mess, you sort it out."

Things are nearly as bad over on planet unix, there they're are two incompatible euc-jp encodings.

Well, Ok lets try one of these private encodings. . . Ah they don't encode the "Long swung dash" character nor the "TEL" character. That may be "correct" but it's not very helpful.

OK damn the support for encodings in the XML parser. I have 5.8.5 (and I don't care who knows it). I can encode and decode strings from UTF-8 to any and back again. Ah but there are pits to fall in here too.

First, the encoding pragma sets the encoding output for the script, not for the modules that the script uses.

use encoding shiftjis; # use XML::Parser; my $p=new XML::Parser(Style=>'Debug');

The output from the parser uses the :raw layer, not Shift_JIS. Result: 1001 nonsence kanji fill my screen.

Moreover unless you can control the ProtocolEncoding, and not all subclasses of XML give you that control (think XML::RSS), you can decode your Shift_JIS file to utf-8, but the xml declaration will still say "Shift_JIS" and the XML parser won't know what to do because the parser has never heard of Shift_JIS, and even if it has, the UTF-8 document you are feeding it certainly isn't valid Shift_JIS xml. You're going to have to start munging the xml declaration to get it to work.

And all this is because the two encoding handlers choose to have a battle through your code, and you are left trying to keep them apart.

The world would be so much better if everyone used 4-byte unicode.


In reply to Encoding is a pain. by zeimusu

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.