XML::Parser - Keep Encoding?

Doron has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: XML::Parser - Keep Encoding? by ikegami (Patriarch) on Apr 21, 2009 at 15:31 UTC
I take it you're worried about speed? Encoding is very fast (i.e. a no-op) if you use `utf8::encode` or `binmode(STDOUT, ':utf8')` (when starting from a string internally encoded as UTF-8 as is the case here).	[reply] [d/l] [select]
Re^2: XML::Parser - Keep Encoding? by Doron (Initiate) on Apr 22, 2009 at 09:47 UTC
Hello, Thanks for your replies so far. I tried the `binmode(STDOUT, 'utf8')` and it works when I have placed the string within the code: `binmode(STDOUT, ':utf8');<br /> print 'ö';` [download] This works just as expected. I tried to open a UTF-8 encoded file using `open(my $fh, '<:encoding(utf-8)', 'test') or die $!;` and managed to print it in the encoding I wanted. However, it does not have any effect on the strings I got through XML::Parser. As I said, I have an XML file which I parse and based on that a HTML::Widget object is generated, which loads some information (to fill <select> fields) from a database which is already encoded in UTF-8, so the output of `$widget->as_xml()` contains iso-8859-1 and UTF-8 encoded parts, which makes it impossible to `utf8::encode` it afterwards. Additionally, the output is generated through the Template Toolkit. I went through the Encoding manpage, but obviously I still can't understand how encodings are handled. I was hoping for a way to tell Perl that everything should be handled in UTF-8. First I thought the `utf8` pragma would do the trick, but I found out that it tells Perl only that the code is written in UTF-8. Whatever `use open ':utf8';` does, it's not want I want either. Maybe my application design makes it even more difficult to understand where to find the mistake: The chain is as follows: XML document --> XML::Parser --> HTML::Widget generation --> filling data from database in the HTML::Widget --> putting the HTML::Widget in a TT template --> output through CGI::Application I can of course encode the contents from my XML file after parsing it, but before generating the HTML::Widget. However, I do not think that this is the cleanest solution. Any more thoughts?	[reply] [d/l] [select]
Re^3: XML::Parser - Keep Encoding? by ikegami (Patriarch) on Apr 23, 2009 at 02:11 UTC
I have no problem. use strict; use warnings; use HTML::Widget qw( ); my ($up) = @ARGV or die; my $s = chr(0xC9); if ($up) { utf8::upgrade($s); # Internally encoded as UTF-8 } else { utf8::downgrade($s); # Internally encoded as iso-latin-1 } my $w = HTML::Widget->new('widget')->method('GET'); $w->element('Textfield', 's')->label($s)->value($s); my $form = $w->process(); binmode(STDOUT, ':utf8'); print(<<"__EOI__"); <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <title>Test</title> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> <p>$s</p> $form __EOI__ [download] `>perl 759467.pl 0 > 0.html >perl 759467.pl 1 > 1.html >fc /b 0.html 1.html Comparing files 0.html and 1.HTML FC: no differences encountered` [download] HTML::Widget also worked fine with characters iso-latin-1 can't represent. Please provide a minimal program to demonstrate your program.	[reply] [d/l] [select]
Re^4: XML::Parser - Keep Encoding? by Doron (Initiate) on Apr 27, 2009 at 17:52 UTC
Re: XML::Parser - Keep Encoding? by mirod (Canon) on Apr 21, 2009 at 15:41 UTC
Actually, if you are using a modern (5.8) perl, the internal format is utf8 (or close enough, IIRC it is actually a superset of utf8). On output though, unless you specify that you want utf8, perl will convert the data back to ISO-8859-1, if possible. Telling perl to output the data in utf8 is done, as ikegami mentioned, using the `binmode` function.	[reply]
Re^2: XML::Parser - Keep Encoding? by ikegami (Patriarch) on Apr 21, 2009 at 15:50 UTC
Actually, if you are using a modern (5.8) perl, the internal format is utf8 (or close enough, IIRC it is actually a superset of utf8). Almost. The standard character set is UTF-8 (case-insensitive, with a dash). The internal format is locally known (non-standard, Perl-only) as utf8. It's a superset of UTF-8 capable of representing all 32-bit or 64-bit numbers (depending on the system, for some definition of system).	[reply]