artist has asked for the wisdom of the Perl Monks concerning the following question:

I have a proccess in place to convert files.
  1. convert CSV to XML using XML::Writer
  2. convert XML to HTML using XML::XSLT.
Now a file with a character 'ÿ' is creating problem. It's reported Char: ÿ (04377, 2303, 0x8ff, file FF) under emacs.
I have my headers set for UTF-8 in XML file. Converting CSV to XML is not posing any problem, but converting it to HTML gives the message " .Error while parsing:not well-formed at line 13 , column 54, byte 498" which is this particular character..

Are there problems with XML::Writer or am I missing something else?

Thanks,
artist

Replies are listed 'Best First'.
Re: XML and special characters
by tilly (Archbishop) on Feb 21, 2004 at 02:51 UTC
    The problem is that 0xFF is in the extended character set and is being passed straight through to the file. Without knowing your encoding, you don't know what character it is supposed to be. This is the problem that Unicode is supposed to solve, but it does it with a careful encoding mechanism, and your 0xFF appearing raw in a file that is supposed to be Unicode is invalid Unicode, and so you get an error.

    I'm not sure what the best API is for solving it, but you can solve it as follows if you have a recent enough Perl (I know that Perl 5.8 works, I dunno how bad Perl 5.6 is in this respect). If your XML file is called, say, "output.xml", then open it with, IO::File->new(">:utf8", "output.xml") and leave the rest of your code alone. (I may have that open command wrong, if I did then stare at documentation and play with it until you figure out how to convince Perl to automatically output correct Unicode.)

    I won't guarantee, however, that the author of XML::Writer won't some day decide to solve the Unicode problem on his end, leading to a double-encoding and garbage output. Contacting him may therefore be worthwhile.