Hmm. When I comment out that "binmode" line as you suggest, I see the difference in the out.xml file: both "test1" and "test2" elements in that file contain a single byte, 0x97, while the print_out.xml and STDOUT contain utf8-like two-byte sequence C2 97 for both elements. I think what this points out more than anything is Perl 5.8's ambiguous (or perhaps slightly schizoid) treatment of characters in the range 0x80 - 0xff; I still haven't probed all the subtleties involved there.
Anyway, since you appear to be dealing with input that is not really unicode in the first place, you should identify what the true encoding is (probably one of the CP125* sets) and convert it to unicode (see the Encode module) before passing it on to XML::Parser. Probably the easiest way would be a separate script that has nothing to do with XML, but just filters text data, using the Encode module to convert from a non-unicode character set to utf8.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.