I've been trying to answer this question ever since I started playing with XML. I've got some 8-bit character data. I don't know what character set it's in, and I can't find out. I want to put it in an XML document such that when I read that document later I get the same 8-bit characters in Perl.

At the moment I'm writing the data using XML::Writer, with code like:

my $writer = XML::Writer->new(OUTPUT => $fh, DATA_MODE => 1, DATA_INDENT => 4); $writer->dataElement(foo => $bar); $writer->end();

Then, later I try to read it in again using XML::Simple:

    my $data = XMLin($xml, %args);

This blows up when $bar contains characters that aren't legal for UTF-8:

not well-formed (invalid token) at line 25, column 102, byte 980 a +t /usr/local/krang/lib/i686-linux/XML/Parser/Expat.pm line 478

What is to be done?

UPDATE: Taking gmpassos's suggestion, I adopted a mechanism similar to XML::Smart. I created a sub-class of XML::Writer which will automatically Base64 encode character content which has illegal characters in it. This content is prefixed with a "!!!BASE64!!!" marker. I then created a sub-class of XML::Simple which will automatically decode these sections by looking for the marker.

It sure isn't pretty, but it sure does work. Maybe someday I'll come up with something more elegent, but until then I'm happy to mark this one FIXED in Bugzilla and move on. Thanks monks!

-sam


In reply to 8-bit Clean XML Data I/O? by samtregar

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.