Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

XML::XPath and preserving CDATA fields

by mfriedman (Monk)
on May 30, 2002 at 02:45 UTC ( [id://170286]=perlquestion: print w/replies, xml ) Need Help??

mfriedman has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed Monks:

I am currently using XML::XPath to extract and change some nodes in an XML document. I use code similar to the following:

my $xpath = XML::XPath->new('xml' => $XML); foreach my $node ($xpath->findnodes('/foo/bar/element')->get_nodelist) + { my $data = my_munge_function($node->as_string); my $id = $node->getAttribute('id'); $xpath->setNodeText('/foo/bar/element[@id=' . $id . ']', $data); }
I am then retrieving the modified XML thusly:
my $newXML = $xpath->findnodes_as_string('/');

NOTE: I'm not sure if all the syntax above is exactly correct - I just wrote it out from memory, since the actual code is not in front of me right now.

This all works fine, except for one thing. When I retrieve the modified XML from the $xpath object, the CDATA fields surrounding certain data have disapeared, and XML::XPath has accidentally escaped some characters that shouldn't be. As an example, the XML

<foo> <bar> <text><![CDATA[La dee da de da.<br>Foo bar baz]]></text> </bar> </foo>
Comes back as:
<foo> <bar> <text>La dee da de da.&amp;lt;br>Foo bar baz</text> </bar> </foo>
I think I know why the CDATA field disapears; the parser is returning the content of the field to XML::XPath but not the information that it is a CDATA. The problem is that this XML must then go to an XSL transformer (Sablotron, in this case) and it's broken.

I would appreciate dearly any insight into this matter.

Thanks,

-Mike

Replies are listed 'Best First'.
Re: XML::XPath and preserving CDATA fields
by mirod (Canon) on May 30, 2002 at 11:28 UTC

    Getting &amp;lt;br> instead of &lt;br> is certainly a bug, but note that using the latest version (1.12) I get the right output though, so if you are using an earlier version you might want to upgrade.

    As for XML::XPath turning the CDATA section into regular PCDATA with entities for & and <, I would think this is a design choice, the fact that there ever was a CDATA section seems to be totally ignored by XML::XPath. For XML purposes the 2 versions are equivalent, CDATA sections are just a shortcut to avoid typing a bunch of entities. That said I know for a lot of applications there is a difference between the 2, especially when you want to include regular HTML within an XML document, and I don't really like modules that don't preserve the original form of the input document, but hey,XML::XPath is so convenient, it might be worth using it and writing an extra step that restores the CDATA section, so here is my solution:

    #!/usr/bin/perl -w use strict; use XML::XPath; undef $/; my $XML=<DATA>; my $xpath = XML::XPath->new('xml' => $XML); foreach my $node ($xpath->findnodes('/foo/bar/text')->get_nodelist) { my $data = your_munge_function($node->string_value); my $id = $node->getAttribute('id'); $xpath->setNodeText('/foo/bar/text[@id="' . $id . '"]', $data); } my $newXML = $xpath->findnodes_as_string('/'); # safe because XML::XPath entiti-zes > in attributes $newXML=~ s{(<text[^>]*>)(.*?)(</text>)} {$1 . cdata_ize($2) . $3}eg; print $newXML; sub your_munge_function { return "munged $_[0]"; } sub cdata_ize { my $text= shift; $text=~ s{&amp;}{&}g; $text=~ s{&lt;}{<}g; return "<![CDATA[$text]]>"; } __DATA__ <foo> <bar> <text id="text1>"><![CDATA[La dee da de da.<br>Foo bar baz]]></text> </bar> </foo>
      Hi mirod,

      Thanks a lot for your suggestion. I think I am going to end up doing something similar to that. I was hoping there would be a more elegant solution than manually fixing the CDATA fields, though. :)

        OK, I know everybody was waiting for me to use my hammer ;--) ... here is a solution using XML::Twig. One big caveat though is that XML::Twig's version of XPath is way, way, _WAY_ less powerful than what XML::XPath offers. No functions except string, complex sub expressions not supported, you name it. It does /foo/bar/text/* though ;--)

        #!/usr/bin/perl -w use strict; use XML::Twig; my $twig = XML::Twig->new( pretty_print => 'indented'); $twig->parse( \*DATA); # the * means that the nodes returned will be either #PCDATA # or #CDATA, this would not work if the content of text was... # not text but included sub elements foreach my $node ($twig->find_nodes('/foo/bar/text/*')) { my $data = your_munge_function($node->text); $node->set_text( $data); } $twig->print; sub your_munge_function { return "munged $_[0]"; } __DATA__ <foo> <bar> <text id="text1>"><![CDATA[La dee da de da.<br>Foo bar baz]]></text> <text id="text2>">a normal text></text> </bar> </foo>
Re: XML::XPath and preserving CDATA fields
by Matts (Deacon) on May 30, 2002 at 15:14 UTC
    This is a design decision. I wanted to base the module as closely as possible on the XPath specification as possible (as this makes implementation easier - and believe me it was no easy task writing that module). If you look at the section on text nodes in the XPath spec, you'll see how CDATA sections are just treated exactly as non-CDATA section nodes. In order to be able to merge adjacent text nodes, the concept of CDATA nodes had to go.

    I know that probably doesn't give you the answer you wanted. If you *really* need CDATA information in the output stream, I suggest you look at something with a more feature rich API like XML::LibXML, SAX, or XML::Parser.

      Matts++ (like he needs any more xp but it's the thought that counts)

      I asked this same question on the mailing list a while ago :)
      If you still like the thought of XPath over DOM (as I did/do) but would like to preserve/edit/add... CDATA then I found XML::LibXML a great alternative to XML::XPath. (plus it works well with XML::LibXSLT which I beleive is faster than sablotron)

      <edit> added some code </edit>
      #!/usr/bin/perl -w use strict; use XML::LibXML; undef $/; my $XML=<DATA>; my $parser = XML::LibXML->new(); my $xmlp = $parser->parse_string($XML); print $xmlp->toString; #before munge foreach my $node ($xmlp->findnodes('/foo/bar/text')) { my $data = your_munge_function($node->findvalue('text()')); $node->findnodes('text()')->get_node(1)->setData($data); } print $xmlp->toString; #after munge # functions sub your_munge_function { return "munged $_[0]"; } # data __DATA__ <foo> <bar> <text id="text1"><![CDATA[La dee da de da.<br>Foo bar baz]]></text> <text id="text2">normal text</text> </bar> </foo>
Re: XML::XPath and preserving CDATA fields
by hackmare (Pilgrim) on May 30, 2002 at 14:16 UTC
    Mike, This is a most indirect solution, and I apologise for it. But I think that Peter Wainwright had the same issue in his SVG::Parser module on CPAN and I am fairly certain that he solved it. So take a look at SVG::Parser (V0.97, I think) and see how he handles it.
    I hope I'm not wrong,

    hackmare.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://170286]
Approved by ChemBoy
Front-paged by cLive ;-)
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (7)
As of 2024-04-23 15:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found