The proper way to include non-xml data (that make the XML not-well-formed and thus the parser die) is to escape it.
There are 2 ways to do this: one is to use entities to replace all '<' and '&'. This is quite easy to generate but isa pain and makes it hard to get back the HTML as markup. The other way is to use CDATA sections. A CDATA section is a fragment of XML that is pretty much skipped by the parser, so it can include markup, as long as it does not include the end-of-cdata-section marker. It makes it quite easy to output the HTML fragment back as markup if you only use CDATA sections for this specific purpose.
Your file would then be:
<item>
<title>test</title>
<description><![CDATA[<b><font color=#"dd0000">temp</a></font></b> try
+]]></description>
<link>http://www.nowhere.com</link>
</item>
Note that this does not mean that the parser will completely ignore the section though, it will just consider it as non-markup. One often overlooked consequence is that you need the encoding of characters in the section to be the same as in the rest of the document, and the same as defined in the encoding attribute of the xmldeclaration. By default this is UTF-8 or UTF-16, so if the HTML is likely to be in an other encoding you will have to either convert it prior to including it in the XML document, or to have the entire document be in this encoding.
Finally, you really should not use XML::Parser, but rather either a higher-level module, based on XML::Parser, a libxml2-based module such as XML::LibXML or a SAX module, that will let you choose you parser (I must confess I do not know how CDATA sections are supported in SAX2 though).
Oh, and do I really need to mention that XML::Twig has a method that would work quite well in this case? ;--) $elt->remove_cdata turns all CDATA sections in the element into regular mark-up (actually you cannot access individual elements within the CDATA section, but when you output it it skips the CDATA markers, and you should get the result you want). |