papo has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to parse and modify some xml files unsing XML::DOM(::XPath). These xml files are encoded in UTF-8 and provide a correct xml declaration:
<?xml version = "1.0" encoding = "UTF-8"?>
If I save or print the modified xml, that declaration still is intact, however the actual file encoding changes to latin1. This even happens if I comment out all of the modifications and just parse the file. This reduces my code to:
#!/usr/bin/perl use XML::DOM::XPath; use strict; use warnings; my $parser= XML::DOM::Parser->new(); my $srcdir=shift; my $dstdir=shift; opendir(SRC,$srcdir); foreach(grep (/\.xml$/,readdir(SRC))) { my $doc = $parser->parsefile("$srcdir/$_"); $doc->printToFile("$dstdir/$_"); $doc->dispose; } close(SRC);
My knowledge of perl is rather basic, I must be doing something utterly wrong here but unfortunately, I can't figure out what it is. I'm using perl 5.8.8 and XML::DOM 1.43.

Any hints are appreciated; thank you in advance.

Replies are listed 'Best First'.
Re: XML::DOM encodes in latin1 instead of UTF-8
by moritz (Cardinal) on Jun 18, 2008 at 21:46 UTC
    There are two bug reports describing the same behaviour: 27793 and 14579 - one about a year old, the other two years old.

    To quote one review:

    XML::DOM is also quite old, and not actively supported, although the maintainer must be thanked for taking care of a module that he does not really use.

    The last release happened two years ago.

Re: XML::DOM encodes in latin1 instead of UTF-8
by pc88mxer (Vicar) on Jun 18, 2008 at 21:46 UTC
    I think this is a bug in XML::DOM. A work-around is to print to a string and then output it yourself with the correct I/O layers in place:
    my $doc = $parser->parsefile("$srcdir/$_"); my $output = $parser->toString; open(my $out, ">:utf8", "$dstdir/$_") or die "..."; print $out $output; close($out);
Re: XML::DOM encodes in latin1 instead of UTF-8
by Your Mother (Archbishop) on Jun 18, 2008 at 22:00 UTC

    Give this a spin. I was kind of screwing around with the Path::Class stuff just to have fun (not exactly how I'd do a production script -- still this has better error checking than the original and XML::LibXML is really quite nice).

    use strict; use warnings; use XML::LibXML; use Path::Class; my $parser = XML::LibXML->new(); my $srcdir = shift || die "Forgot to give the source dir.\n"; my $dstdir = Path::Class::Dir->new( shift || die "Forgot to give the d +estination dir.\n" ); my @files = map { Path::Class::File->new($_) } glob("$srcdir*\.xml"); for my $file ( @files ) { print $file, $/; my $doc = $parser->parse_file($file); my $new_file = Path::Class::File->new( $dstdir, $file->basename ); print -e $new_file ? "Overwriting $new_file from $file\n" : "Creating $new_file from $file\n"; $doc->toFile($new_file, 1); }