The real problem is 'The Unicode'

Most Perl XML modules are built on top of expat, or XML::Parser which is an interface to expat. Expat is XML parser. It will get your XML (XHTML) document and process its tags and so on. But as XML is fundamentaly based on unicode, expat will convert all your characters to unicode. For this conversion to work properly, you should have valid encoding specified in XML header: <?xml version='1.0' encoding='iso-8859-2'?> This is the primary reason for these odd charaters you encounter. They are utf-8 (8-bit Unicode) representation of non-english characters.

You probably want to avoid this coversion. I have similar problem maybe a year ago, but found no useful solution. XML::Parser has a original_string method which returns character data in original encding, but it wont expand entities. And there is no way to get attributes in original encoding. Best solution around this is to use Unicode::Map8 to map all unicode strings back to their original encodig, but this is terribly slow solution for frequent use.

So I wrote my own poor man's XML parser based on Perl patterns. But it is not a solution, but a hack. If you plan to use XML, use should better move to Unicode completly.

PS: I wonder how XML::Twig implements its keep_encoding option. By forcing expat to behave reasonably or by back conversion to original charset?


In reply to Re: XML and entities, what am I doing wrong? by gildir
in thread XML and entities, what am I doing wrong? by kevin_i_orourke

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.