in reply to XML::Parser question

The proper way to include non-xml data (that make the XML not-well-formed and thus the parser die) is to escape it.

There are 2 ways to do this: one is to use entities to replace all '<' and '&'. This is quite easy to generate but isa pain and makes it hard to get back the HTML as markup. The other way is to use CDATA sections. A CDATA section is a fragment of XML that is pretty much skipped by the parser, so it can include markup, as long as it does not include the end-of-cdata-section marker. It makes it quite easy to output the HTML fragment back as markup if you only use CDATA sections for this specific purpose.

Your file would then be:

<item> <title>test</title> <description><![CDATA[<b><font color=#"dd0000">temp</a></font></b> try +]]></description> <link>http://www.nowhere.com</link> </item>

Note that this does not mean that the parser will completely ignore the section though, it will just consider it as non-markup. One often overlooked consequence is that you need the encoding of characters in the section to be the same as in the rest of the document, and the same as defined in the encoding attribute of the xmldeclaration. By default this is UTF-8 or UTF-16, so if the HTML is likely to be in an other encoding you will have to either convert it prior to including it in the XML document, or to have the entire document be in this encoding.

Finally, you really should not use XML::Parser, but rather either a higher-level module, based on XML::Parser, a libxml2-based module such as XML::LibXML or a SAX module, that will let you choose you parser (I must confess I do not know how CDATA sections are supported in SAX2 though).

Oh, and do I really need to mention that XML::Twig has a method that would work quite well in this case? ;--) $elt->remove_cdata turns all CDATA sections in the element into regular mark-up (actually you cannot access individual elements within the CDATA section, but when you output it it skips the CDATA markers, and you should get the result you want).

Replies are listed 'Best First'.
Re: Re: XML::Parser question
by primus (Scribe) on Feb 08, 2003 at 22:18 UTC

    thank you monks for the help, the only thing which i suppose i should have stated earlier, is that i do not have control over the formatting of the xml... i am pulling the xml from an outside source, and i kinda get what they give me... i hope i can apply some of this to that. thanks again.

      Oh my! Not again!

      If what you get is really what you describe, then do yourself (and your text-in-pointy-brackets provider) a favor: don't call it XML. And write (or have your povider write) a hundred times "If it does not parse, then it is NOT XML" Whether the reason is messed up tags, an encoding problem or anything else, they have no business calling it XML if an XML parser doesn't say that it is well-formed.

      Once you have realized this, it then makes sense that, as you are not processing XML, you cannot use XML tools. At least not directly You need first to convert the data you get into real XML, or even better, have the source provide real XML, make sure it is OK by parsing it, and then you can use an XML module.