in reply to Re: problem with XML::Writer, unicode and Perl 5.6.0 upgrade
in thread problem with XML::Writer, unicode and Perl 5.6.0 upgrade

> Is that what you intend/expect it to be?

The data is provide by our users (publishers) so I just try to make whatever is given to us display. I don't pretent to know that much about unicode.

> (If you were expecting it to be some displayable character, then either you
> have the wrong code point in your data, or else you're saying/pretending
> it's unicode when in fact it is not. BTW, I notice that 0x97 is used in the MS
> "CP125*" code pages for "em dash", which is "officially" supposed to
> transliterate into U2014, which in turn should yield a 3-byte utf8 sequence:
> E2 80 94.)

Good point. I think our user did want U2014.

> I tried the test script that you posted in a reply above, and it seemed to put
> a U0097 character -- in utf8 encoding (i.e. as the two-byte sequence
> C2 97) -- for both "test1" and "test2" elements, in all of its outputs (the
> "print_out.xml" file, the "out.xml" file, and STDOUT; of course, I had to use
> a hex dump to actually "see" the character in all cases, since it is not
> displayable). Does that run contrary to your own findings?

right - sorry for the lack of info here. To see the "<97>" character from XML::Writer you need to comment out:

binmode($out_file, ":encoding(utf-8)");

I added that because that is what I added to my Writer.pm to get the "correct" character (<C2><97>). I'm starting to think that <C2><97> is not correct though.

Replies are listed 'Best First'.
Re^3: problem with XML::Writer, unicode and Perl 5.6.0 upgrade
by graff (Chancellor) on Sep 29, 2004 at 03:20 UTC
    Hmm. When I comment out that "binmode" line as you suggest, I see the difference in the out.xml file: both "test1" and "test2" elements in that file contain a single byte, 0x97, while the print_out.xml and STDOUT contain utf8-like two-byte sequence C2 97 for both elements. I think what this points out more than anything is Perl 5.8's ambiguous (or perhaps slightly schizoid) treatment of characters in the range 0x80 - 0xff; I still haven't probed all the subtleties involved there.

    Anyway, since you appear to be dealing with input that is not really unicode in the first place, you should identify what the true encoding is (probably one of the CP125* sets) and convert it to unicode (see the Encode module) before passing it on to XML::Parser. Probably the easiest way would be a separate script that has nothing to do with XML, but just filters text data, using the Encode module to convert from a non-unicode character set to utf8.