shamu has asked for the wisdom of the Perl Monks concerning the following question:

I'm using XML::LibXML 1.66 and libxml2 2.6.26, to parse some XML that I would like to print the XML as it is read, without modification. When I run the following, all the entities are expanded...not what I want. Anyone know how to correct this?
#!/usr/bin/perl use strict; use XML::LibXML; my $fh = *DATA; my $parser = XML::LibXML->new(); $parser->expand_entities(0); my $doc = $parser->parse_fh( $fh ); my $root = $doc->getDocumentElement; my $format = 1; my $docencoding = 1; foreach my $xform_node ($root->findnodes('jobs/job')) { next if($xform_node->nodeType != &XML_ELEMENT_NODE); my $path = $xform_node->findvalue('info/directory'); my $xmlstring = $xform_node->toString($format,$docencoding); print "$path\n"; print "$xmlstring\n"; } __DATA__ <?xml version="1.0" encoding="UTF-8"?> <repository> <jobs> <job> <name>Screening</name> <directory>&#47;Biometrics&#47;TestCase</directory> <created_user>admin</created_user> <created_date>2007&#47;11&#47;29 15:29:07.000</created_date> <modified_user>admin</modified_user> <modified_date>2008&#47;02&#47;11 15:58:28.000</modified_date> </job> </jobs> </repository>

Replies are listed 'Best First'.
Re: XML::LibXML expand_entities always expands entities
by pc88mxer (Vicar) on May 14, 2008 at 16:26 UTC
    The entities that this option refers to are not character entities (like &#47;) but ones defined by <!ENTITY ...> declarations. See this page for more info.

    Here's an example of how it works:

    #!/usr/bin/perl use strict; use XML::LibXML; my $fh = *DATA; my $parser = XML::LibXML->new(); $parser->expand_entities($ARGV[0]); my $doc = $parser->parse_fh( $fh ); my $root = $doc->getDocumentElement; print $root->toString(1, 1), "\n"; __END__ <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE author [ <!ELEMENT author (#PCDATA)> <!ENTITY js "Jo Smith"> ]> <author>&js;</author>
    If called with a true argument, the output will be:
    <author>Jo Smith</author>
    and if called with a false argument, the output is:
    <author>&js;</author>
      How can I prevent standard entities (e.g. &#47;) from being decoded?
    Re: XML::LibXML expand_entities always expands entities
    by ikegami (Patriarch) on May 14, 2008 at 17:15 UTC
      May I ask why? "&#47;" and the character it decodes to are completely equivalent in XML. There might not even be a way since the parser may not distinguish between the two.
        I'm reading a file and I want the contents to match exactly, the contents should be unmodified. I'd like to perform a diff on the source and destination, they don't match if one is '&#47;' and the other is '/'.

          This may or may not be any help but I do something somewhat related. I decode all the safe entities in HTML before parsing it with XML::LibXML. Along these lines-

          use HTML::Entities; our %Charmap = %HTML::Entities::entity2char; delete @Charmap{qw( amp lt gt quot apos )}; HTML::Entities::_decode_entities($html, \%Charmap);

          You would then have something closer up front for comparing. Maybe. They're both processed data but at least you'd know they processed the same.