XML::XPath and preserving CDATA fields

mfriedman has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks:

I am currently using XML::XPath to extract and change some nodes in an XML document. I use code similar to the following:

my $xpath = XML::XPath->new('xml' => $XML);
foreach my $node ($xpath->findnodes('/foo/bar/element')->get_nodelist)
+ { 
    my $data = my_munge_function($node->as_string);
    my $id = $node->getAttribute('id');
    $xpath->setNodeText('/foo/bar/element[@id=' . $id . ']', $data);
}
[download]

I am then retrieving the modified XML thusly:

my $newXML = $xpath->findnodes_as_string('/');
[download]

NOTE: I'm not sure if all the syntax above is exactly correct - I just wrote it out from memory, since the actual code is not in front of me right now.

This all works fine, except for one thing. When I retrieve the modified XML from the $xpath object, the CDATA fields surrounding certain data have disapeared, and XML::XPath has accidentally escaped some characters that shouldn't be. As an example, the XML

<foo>
  <bar>
    <text><![CDATA[La dee da de da.<br>Foo bar baz]]></text>
  </bar>
</foo>
[download]

Comes back as:

<foo>
  <bar>
    <text>La dee da de da.&amp;lt;br>Foo bar baz</text>
  </bar>
</foo>
[download]

I think I know why the CDATA field disapears; the parser is returning the content of the field to XML::XPath but not the information that it is a CDATA. The problem is that this XML must then go to an XSL transformer (Sablotron, in this case) and it's broken.

I would appreciate dearly any insight into this matter.

Thanks,

-Mike

Comment on XML::XPath and preserving CDATA fields Select or Download Code

Replies are listed 'Best First'.
Re: XML::XPath and preserving CDATA fields by mirod (Canon) on May 30, 2002 at 11:28 UTC
Getting `&lt;br>` instead of `<br>` is certainly a bug, but note that using the latest version (1.12) I get the right output though, so if you are using an earlier version you might want to upgrade. As for XML::XPath turning the CDATA section into regular PCDATA with entities for & and <, I would think this is a design choice, the fact that there ever was a CDATA section seems to be totally ignored by XML::XPath. For XML purposes the 2 versions are equivalent, CDATA sections are just a shortcut to avoid typing a bunch of entities. That said I know for a lot of applications there is a difference between the 2, especially when you want to include regular HTML within an XML document, and I don't really like modules that don't preserve the original form of the input document, but hey,XML::XPath is so convenient, it might be worth using it and writing an extra step that restores the CDATA section, so here is my solution: #!/usr/bin/perl -w use strict; use XML::XPath; undef $/; my $XML=<DATA>; my $xpath = XML::XPath->new('xml' => $XML); foreach my $node ($xpath->findnodes('/foo/bar/text')->get_nodelist) { my $data = your_munge_function($node->string_value); my $id = $node->getAttribute('id'); $xpath->setNodeText('/foo/bar/text[@id="' . $id . '"]', $data); } my $newXML = $xpath->findnodes_as_string('/'); # safe because XML::XPath entiti-zes > in attributes $newXML=~ s{(<text[^>]>)(.?)(</text>)} {$1 . cdata_ize($2) . $3}eg; print $newXML; sub your_munge_function { return "munged $_[0]"; } sub cdata_ize { my $text= shift; $text=~ s{&}{&}g; $text=~ s{<}{<}g; return "<![CDATA[$text]]>"; } __DATA__ <foo> <bar> <text id="text1>"><![CDATA[La dee da de da.<br>Foo bar baz]]></text> </bar> </foo> [download]	[reply] [d/l] [select]
Re: Re: XML::XPath and preserving CDATA fields by Anonymous Monk on May 30, 2002 at 15:11 UTC
Hi mirod, Thanks a lot for your suggestion. I think I am going to end up doing something similar to that. I was hoping there would be a more elegant solution than manually fixing the CDATA fields, though. :)	[reply]
Re: Re: Re: XML::XPath and preserving CDATA fields by mirod (Canon) on May 30, 2002 at 16:07 UTC
OK, I know everybody was waiting for me to use my hammer ;--) ... here is a solution using XML::Twig. One big caveat though is that XML::Twig's version of XPath is way, way, _WAY_ less powerful than what XML::XPath offers. No functions except string, complex sub expressions not supported, you name it. It does `/foo/bar/text/` though ;--) #!/usr/bin/perl -w use strict; use XML::Twig; my $twig = XML::Twig->new( pretty_print => 'indented'); $twig->parse( \DATA); # the * means that the nodes returned will be either #PCDATA # or #CDATA, this would not work if the content of text was... # not text but included sub elements foreach my $node ($twig->find_nodes('/foo/bar/text/*')) { my $data = your_munge_function($node->text); $node->set_text( $data); } $twig->print; sub your_munge_function { return "munged $_[0]"; } __DATA__ <foo> <bar> <text id="text1>"><![CDATA[La dee da de da.<br>Foo bar baz]]></text> <text id="text2>">a normal text></text> </bar> </foo> [download]	[reply] [d/l]
Re: Re: Re: Re: XML::XPath and preserving CDATA fields by mfriedman (Monk) on May 30, 2002 at 19:36 UTC
Re: XML::XPath and preserving CDATA fields by Matts (Deacon) on May 30, 2002 at 15:14 UTC
This is a design decision. I wanted to base the module as closely as possible on the XPath specification as possible (as this makes implementation easier - and believe me it was no easy task writing that module). If you look at the section on text nodes in the XPath spec, you'll see how CDATA sections are just treated exactly as non-CDATA section nodes. In order to be able to merge adjacent text nodes, the concept of CDATA nodes had to go. I know that probably doesn't give you the answer you wanted. If you really need CDATA information in the output stream, I suggest you look at something with a more feature rich API like XML::LibXML, SAX, or XML::Parser.	[reply]
Re: Re: XML::XPath and preserving CDATA fields by IOrdy (Friar) on May 31, 2002 at 04:57 UTC
Matts++ (like he needs any more xp but it's the thought that counts) I asked this same question on the mailing list a while ago :) If you still like the thought of XPath over DOM (as I did/do) but would like to preserve/edit/add... CDATA then I found XML::LibXML a great alternative to XML::XPath. (plus it works well with XML::LibXSLT which I beleive is faster than sablotron) <edit> added some code </edit> #!/usr/bin/perl -w use strict; use XML::LibXML; undef $/; my $XML=<DATA>; my $parser = XML::LibXML->new(); my $xmlp = $parser->parse_string($XML); print $xmlp->toString; #before munge foreach my $node ($xmlp->findnodes('/foo/bar/text')) { my $data = your_munge_function($node->findvalue('text()')); $node->findnodes('text()')->get_node(1)->setData($data); } print $xmlp->toString; #after munge # functions sub your_munge_function { return "munged $_[0]"; } # data __DATA__ <foo> <bar> <text id="text1"><![CDATA[La dee da de da.<br>Foo bar baz]]></text> <text id="text2">normal text</text> </bar> </foo> [download]	[reply] [d/l]
Re: XML::XPath and preserving CDATA fields by hackmare (Pilgrim) on May 30, 2002 at 14:16 UTC
Mike, This is a most indirect solution, and I apologise for it. But I think that Peter Wainwright had the same issue in his SVG::Parser module on CPAN and I am fairly certain that he solved it. So take a look at SVG::Parser (V0.97, I think) and see how he handles it. I hope I'm not wrong, hackmare.	[reply]


more useful options
	PerlMonks