santellij has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

We had been using XML::Writer under Perl 5.6.0 for a while. We are now in the process of upgrading to Perl 5.8.5.

Our source XML documents often contain UTF-8 encoded Unicode characters and we found that XML::Writer in not creating the same characters that it used to. One problem character that we found is — (I understand that there has been significant unicode changes in recent versions of Perl).

Here is what we get (viewed with `less`) after 'converting it to utf-8' from just printing the XML and what the XML::Writer produces:
XML output via 'print': <test><C2><97></test> XML output via XML::Writer: <test><97></test>

The 'solution' we came up with was to add:

  binmode($FILE_HANDLE, ":encoding(utf-8)");

to the Writer.pm just after it picks up the file handle like:

# Set the output. if ($params{'OUTPUT'}) {binmode ($params{'OUTPUT'}, ":encoding(utf +-8)"); } &{$self->{'SETOUTPUT'}}($params{'OUTPUT'});

So, is this a valid fix or is it going to give us additional problems later on? If this is a valid patch can someone explain why?

Thanks,
josh

Replies are listed 'Best First'.
Re: problem with XML::Writer, unicode and Perl 5.6.0 upgrade
by graff (Chancellor) on Sep 29, 2004 at 00:18 UTC
    First, a quick sanity check: why does your xml data contain U0097 (a.k.a &#151) -- according to the code chart, this is a non-displayable control character, whose name/function is labeled in the chart as "END OF GUARDED AREA". Is that what you intend/expect it to be?

    (If you were expecting it to be some displayable character, then either you have the wrong code point in your data, or else you're saying/pretending it's unicode when in fact it is not. BTW, I notice that 0x97 is used in the MS "CP125*" code pages for "em dash", which is "officially" supposed to transliterate into U2014, which in turn should yield a 3-byte utf8 sequence: E2 80 94.)

    I tried the test script that you posted in a reply above, and it seemed to put a U0097 character -- in utf8 encoding (i.e. as the two-byte sequence C2 97) -- for both "test1" and "test2" elements, in all of its outputs (the "print_out.xml" file, the "out.xml" file, and STDOUT; of course, I had to use a hex dump to actually "see" the character in all cases, since it is not displayable). Does that run contrary to your own findings?

    (I'm running 5.8.1 on darwin. 5.8.5 shouldn't be any different...)

      > Is that what you intend/expect it to be?

      The data is provide by our users (publishers) so I just try to make whatever is given to us display. I don't pretent to know that much about unicode.

      > (If you were expecting it to be some displayable character, then either you
      > have the wrong code point in your data, or else you're saying/pretending
      > it's unicode when in fact it is not. BTW, I notice that 0x97 is used in the MS
      > "CP125*" code pages for "em dash", which is "officially" supposed to
      > transliterate into U2014, which in turn should yield a 3-byte utf8 sequence:
      > E2 80 94.)

      Good point. I think our user did want U2014.

      > I tried the test script that you posted in a reply above, and it seemed to put
      > a U0097 character -- in utf8 encoding (i.e. as the two-byte sequence
      > C2 97) -- for both "test1" and "test2" elements, in all of its outputs (the
      > "print_out.xml" file, the "out.xml" file, and STDOUT; of course, I had to use
      > a hex dump to actually "see" the character in all cases, since it is not
      > displayable). Does that run contrary to your own findings?

      right - sorry for the lack of info here. To see the "<97>" character from XML::Writer you need to comment out:

      binmode($out_file, ":encoding(utf-8)");

      I added that because that is what I added to my Writer.pm to get the "correct" character (<C2><97>). I'm starting to think that <C2><97> is not correct though.
        Hmm. When I comment out that "binmode" line as you suggest, I see the difference in the out.xml file: both "test1" and "test2" elements in that file contain a single byte, 0x97, while the print_out.xml and STDOUT contain utf8-like two-byte sequence C2 97 for both elements. I think what this points out more than anything is Perl 5.8's ambiguous (or perhaps slightly schizoid) treatment of characters in the range 0x80 - 0xff; I still haven't probed all the subtleties involved there.

        Anyway, since you appear to be dealing with input that is not really unicode in the first place, you should identify what the true encoding is (probably one of the CP125* sets) and convert it to unicode (see the Encode module) before passing it on to XML::Parser. Probably the easiest way would be a separate script that has nothing to do with XML, but just filters text data, using the Encode module to convert from a non-unicode character set to utf8.

Re: problem with XML::Writer, unicode and Perl 5.6.0 upgrade
by bpphillips (Friar) on Sep 28, 2004 at 20:01 UTC
    Could you post some code that demonstrates the discrepancy you're describing?
      It's nothing fancy but here's a test script:
      #!/usr/local/bin/perl use bytes; use strict; use XML::Parser; use XML::Writer; use IO::File; my $out_xml_file = "out.xml"; my $print_out_xml_file = "print_out.xml"; my $xmlFile = "in.xml"; my $parser = new XML::Parser(); $parser->setHandlers( Start => \&StartHandler, End => \&EndHandler, Comment => \&CommentHandler, Char => \&CharHandler, Default => \&DefaultHandler, ); my $print_out_file = new IO::File(">$print_out_xml_file"); my $out_file = new IO::File(">$out_xml_file"); binmode($out_file, ":encoding(utf-8)"); my $write = new XML::Writer(OUTPUT => $out_file, DATA_INDENT => 2); print $print_out_file qq(<?xml version="1.0" encoding="UTF-8"?>); print $print_out_file "<root_out>\n"; $write->xmlDecl("UTF-8"); $write->startTag("root_out"); $parser->parsefile($xmlFile); print $print_out_file "</root_out>"; $write->endTag("root_out"); $write->end(); $out_file->close(); $print_out_file->close(); ## ---------------------## ## Handlers for XML Parser ## sub StartHandler { my ($p, $el) = @_; print "start: $el\n"; print $print_out_file "<$el>\n"; $write->startTag($el); } sub EndHandler { my ($p,$el) = @_; print "end: $el\n"; print $print_out_file "</$el>\n"; $write->endTag($el); } sub CharHandler { my ($p,$chr) = @_; print "Char: $chr\n"; print $print_out_file "$chr"; $write->characters($chr); } sub CommentHandler { my ($p,$com) = @_; print "Comment found: $com\n"; } sub DefaultHandler { my($p,$str) = @_; print "Default found: $str\n"; }

      ...and the "in.xml" file (I'm not sure the utf-8 in the test1 element will be correct after copy-n-paste):
      <?xml version="1.0" encoding="UTF-8"?> <root> <test1>—</test1> <test2>&#151;</test2> </root>