RenalPete has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I'm using XML::DOM to parse, modify then save an XML document. I'm having a problem in that once I save a document which contained character references/entities, they are saved as multibyte characters, without beind re-encoded. XML::DOM then refuses the parse its own output. Please forgive me if I have the terminology wrong, here is an example of what happens:

Input:
<?xml version="1.0"?> <DocumentRoot> <Element Attr="B&#xE4;r" /> </DocumentRoot>
Script:
use XML::DOM; my $file = @ARGV[0]; my $parser = new XML::DOM::Parser(); my $doc; eval { $doc = $parser->parsefile( $file ); }; if ($@) { die "parsefile() failed: $@\n"; } $doc->printToFile($file."_out"); exit;
The output file created has the extended character (a umlaut) written un-encoded. I'm not sure if this will display properly:
<?xml version="1.0"?> <DocumentRoot> <Element Attr="Bär"/> </DocumentRoot>
If I then pass the this output back to the script, I get:
parsefile() failed: not well-formed (invalid token) at line 3, column 17, byte 54 at /usr/ +lib/perl5/XML/Parser.pm line 187
I've had a look in the XML::DOM code, and I reckon that encodeText() would be the place to do it, however this appears to take a list of characters which should be encoded - for Unicode this would be a pretty big list :-) It's quite possible that there's something about hidden nodes which could be relevant - can anyone point me in the right direction?

I guess if all else fails, a bit of dirty regex-ery on the output could work :-P


Thanks!

Replies are listed 'Best First'.
Re: XML::DOM not re-encoding character references of unicode characters?
by Jenda (Abbot) on Nov 19, 2007 at 15:56 UTC