Doron has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I'm using XML::Parser (Tree style) to parse a document written in UTF-8. It seems to translate all text nodes to Perl's internal encoding. Is there a way to keep it in UTF-8? The document is used to generate a HTML::Widget for a UTF-8 based website and I do not want to re-encode the output of my HTML::Widget to UTF-8. Thanks!

Replies are listed 'Best First'.
Re: XML::Parser - Keep Encoding?
by ikegami (Patriarch) on Apr 21, 2009 at 15:31 UTC
    I take it you're worried about speed? Encoding is very fast (i.e. a no-op) if you use utf8::encode or binmode(STDOUT, ':utf8') (when starting from a string internally encoded as UTF-8 as is the case here).

      Hello,

      Thanks for your replies so far.

      I tried the binmode(STDOUT, 'utf8') and it works when I have placed the string within the code:

      binmode(STDOUT, ':utf8');<br /> print 'ö';

      This works just as expected. I tried to open a UTF-8 encoded file using open(my $fh, '<:encoding(utf-8)', 'test') or die $!; and managed to print it in the encoding I wanted.

      However, it does not have any effect on the strings I got through XML::Parser. As I said, I have an XML file which I parse and based on that a HTML::Widget object is generated, which loads some information (to fill <select> fields) from a database which is already encoded in UTF-8, so the output of $widget->as_xml() contains iso-8859-1 and UTF-8 encoded parts, which makes it impossible to utf8::encode it afterwards. Additionally, the output is generated through the Template Toolkit.

      I went through the Encoding manpage, but obviously I still can't understand how encodings are handled. I was hoping for a way to tell Perl that everything should be handled in UTF-8. First I thought the utf8 pragma would do the trick, but I found out that it tells Perl only that the code is written in UTF-8. Whatever use open ':utf8'; does, it's not want I want either.

      Maybe my application design makes it even more difficult to understand where to find the mistake: The chain is as follows: XML document --> XML::Parser --> HTML::Widget generation --> filling data from database in the HTML::Widget --> putting the HTML::Widget in a TT template --> output through CGI::Application

      I can of course encode the contents from my XML file after parsing it, but before generating the HTML::Widget. However, I do not think that this is the cleanest solution.

      Any more thoughts?

        I have no problem.
        use strict; use warnings; use HTML::Widget qw( ); my ($up) = @ARGV or die; my $s = chr(0xC9); if ($up) { utf8::upgrade($s); # Internally encoded as UTF-8 } else { utf8::downgrade($s); # Internally encoded as iso-latin-1 } my $w = HTML::Widget->new('widget')->method('GET'); $w->element('Textfield', 's')->label($s)->value($s); my $form = $w->process(); binmode(STDOUT, ':utf8'); print(<<"__EOI__"); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <title>Test</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <p>$s</p> $form __EOI__
        >perl 759467.pl 0 > 0.html >perl 759467.pl 1 > 1.html >fc /b 0.html 1.html Comparing files 0.html and 1.HTML FC: no differences encountered

        HTML::Widget also worked fine with characters iso-latin-1 can't represent.

        Please provide a minimal program to demonstrate your program.

Re: XML::Parser - Keep Encoding?
by mirod (Canon) on Apr 21, 2009 at 15:41 UTC

    Actually, if you are using a modern (5.8) perl, the internal format is utf8 (or close enough, IIRC it is actually a superset of utf8).

    On output though, unless you specify that you want utf8, perl will convert the data back to ISO-8859-1, if possible. Telling perl to output the data in utf8 is done, as ikegami mentioned, using the binmode function.

      Actually, if you are using a modern (5.8) perl, the internal format is utf8 (or close enough, IIRC it is actually a superset of utf8).

      Almost. The standard character set is UTF-8 (case-insensitive, with a dash).

      The internal format is locally known (non-standard, Perl-only) as utf8. It's a superset of UTF-8 capable of representing all 32-bit or 64-bit numbers (depending on the system, for some definition of system).