Hmm. When I comment out that "binmode" line as you suggest, I see the difference in the out.xml file: both "test1" and "test2" elements in that file contain a single byte, 0x97, while the print_out.xml and STDOUT contain utf8-like two-byte sequence C2 97 for both elements. I think what this points out more than anything is Perl 5.8's ambiguous (or perhaps slightly schizoid) treatment of characters in the range 0x80 - 0xff; I still haven't probed all the subtleties involved there.
Anyway, since you appear to be dealing with input that is not really unicode in the first place, you should identify what the true encoding is (probably one of the CP125* sets) and convert it to unicode (see the Encode module) before passing it on to XML::Parser. Probably the easiest way would be a separate script that has nothing to do with XML, but just filters text data, using the Encode module to convert from a non-unicode character set to utf8.
| [reply] |