in reply to XML::Twig - literal nodes

This is an XML FAQ: if you want to include unstructured text that can include anything, including < and & characters, then you can use CDATA sections:

<doc> <p>regular text here, > needs to be escaped as &lt;</p> <literal><![CDATA[here you can use < and & and whatever you want]]>< +/literal> <literal><![CDATA[this is how you include the CDATA end mark ]]]]><![CDATA[> by spliting it into 2 different CDATA sections]]></literal> </doc>

Note that the CDATA section has no effect on the element structure. In fact it is just a convenience that allows you not to have to escape every single instance of < and & (and " or ' in attributes).

BTW, you probably want to generate HTML from a CDATA section (which would be your next question ;--), even though I don't think browsers support them. It is pretty easy: all you have to do is turn them into regular PCDATA and print them, all special characters will then be escaped!:

#!/bin/perl -w use strict; use XML::Twig; my $t= XML::Twig->new( ); $t->parse( \*DATA); foreach my $cdata ( $t->descendants( '#CDATA')) { $cdata->set_pcdata( $cdata->cdata); $cdata->set_gi( '#PCDATA'); } $t->print; __DATA__ <doc> <p>regular text here, &lt; needs to be escaped as &amp;lt;</p> <literal><![CDATA[here you can use < and & and whatever you want]]>< +/literal> </doc>

updated 2005-05-04: a ]]> was missing from the last CDATA. Thanks to ambrus for pointing this out.

Replies are listed 'Best First'.
Re: Re: XML::Twig - literal nodes
by John M. Dlugosz (Monsignor) on Nov 08, 2001 at 21:12 UTC
    I'm aware of CDATA, but you misunderstand. Here in PM, we don't have to put our <code> in CDATA sections; rather, special characters can appear directly in them. For example, <code>Foo& r1= x; if (x<y) bar();</code>. No typing of CDATA there... just filtering of the source.

    More formally, when a specified start tag is discovered, check its attributes (because this mechanism is optional) and switch in a source filter or otherwise pre-process the input stream, stopping when the pattern "</$name>" is encountered.

    —John

      Sorry, you can't do this in XML.

      XML::Twig reads XML files, and a file with random &'s and <'s is _not_ XML. Hence XML::Twig or any XML tool can't do a thing for you there. If you want to include random special characters then you _have_ to use one of the 2 appropriate schemes allowed in XML: either escape each instance of those character or use a CDATA section.

      What you are describing is an interesting format, it is an extension of the input format accepted by PerlMonks actually, but it is not XML. And no tool based on an XML parser can accept it.

      Darn! You've reached the limit beyond which I can't extend XML::Twig. I can't believe it!

        Picture this:

        The primitive parser (Expat in your case) reads, conceptually, a character at a time and runs it through the grammar rules.

        When I get the open-tag event, the code can change a bit in the thing that the parser is reading from, so that filtering is turned on. It turns itself off when it sees the end tag.

        So, it requires knowledge of the exact input position of the parser, so that logically it reads one char at a time (if it buffers lines, it needs to flush & resync to the "logical" position).

        So I think XML::Twig could handle it, using that method: input filters that are specified as processing input exactly up to the tag that triggered the current callback. So, if the callback changes something in the filter, it will affect everything following that tag in the source.

        See what I mean?

        —John