Re^2: Encode double encoding?

Replies are listed 'Best First'.
Re^3: Encode double encoding? by McA (Priest) on May 23, 2014 at 13:53 UTC
Hi to your question: I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML? Perhaps I don't know the correct question to ask. The output of XML::LibXML seems to be already decoded. When struggling with encoding/decoding the first question to ask is: Is my string a encoded byte representation of a unicode string (I like to call this a byte-string) or is it a character string where every character is represented by a codepoint of unicode? You can answer this question with the following code snippet: `my $string = 'blabla'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n";` [download] Why did I choose such a simple string 'blabla'? I want to show that there is a hidden difference which you can't see: `#!/bin/env perl use strict; use warnings; use Data::Dumper; use Encode; my $string = 'blabla'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n"; print Dumper(\$string), "\n"; my $decoded_string = decode('ASCII', $string); $is_unicode = utf8::is_utf8($decoded_string) ? 'YES' : 'NO'; print "my string: $decoded_string is unicode: $is_unicode\n"; print Dumper(\$decoded_string), "\n";` [download] Do you see the difference. Only the internal flag has changed. In the first case you have a byte string representing the unicode string 'blabla' in the ASCII encoding. In the second case you have decoded (transfer from representation to unicode) character string. Only the fact that the ascii byte sequence of 'blabla' and the code points for these charaters in unicode are the same let the whole thing look like it's the same. It isn't! Now to an example with the German esszett. We have to do some assumptions: Your terminal is set to UTF-8 and you store the following code snippets in UTF-8: `#!/bin/env perl use strict; use warnings; use Data::Dumper; use Encode; my $string = 'müßig'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n"; print Dumper(\$string), "\n"; my $decoded_string = decode('UTF-8', $string); $is_unicode = utf8::is_utf8($decoded_string) ? 'YES' : 'NO'; print "my string: $decoded_string is unicode: $is_unicode\n"; print Dumper(\$decoded_string), "\n";` [download] The output on my terminal of this script is: `my string: müßig is unicode: NO $VAR1 = \'müßig'; my string: m__ig is unicode: YES $VAR1 = \"m\x{fc}\x{df}ig";` [download] (_ is a square on the terminal) That's intersting, isn't it? In the first case we do have a byte string which is the UTF-8 representation of 'müßig'. As soon as it's output your terminal sees a valid byte sequence which is interpreted as 'ü' followed by a 'ß' and shows it to you. It seems to be the correct case. In the second case the output is garbled. So why? We correctly decoded the byte string 'müßig' (which is UTF-8 encoded in the perl source code, make a `hexdump -C sourcefile` to validate this, you'll see a byte sequence hex c3 bc for 'ü' and hex c3 9f for 'ß'). The internal utf8-flag is set (by the way: what a misleading nomenclature IMHO, unicode-flag would have been better). Now you have a real character sting 'müßig' in `$decoded_string`. But as soon as you print it without changing some output encoding layers, Perl assumes an output encoding of Latin-1. Therefore Perl sends the byte hex fc for 'ü' and hex df for 'ß' (correct Latin-1 encodings). Your terminal can't cope with these two bytes and just duisplays an placeholder glyph to show this decoding problem. But have a look at the Dumper output. You see a differece here. The fact that Dumper has to dump a codepoint is shown with the codepoint representation of "\x{fc}" 'ü' and "\x{df}" 'ß'. So, this kind of Dumper output is often a sign that you just dumped a character string in contrast to a byte string. Conclusion: Use utf8::is_utf8() to determine whether a string is a unicode/character string or just a simple byte string. Be careful with concatenation of charater strings and simple byte strings. The resulting string will be a character string with the byte string decoded as Latin-1 and not UTF-8. That may bite you. There are some functions out there which silently drop this utf8-flag in the case that the byte representation of the byte string is the same as the character string (which is a bug IMHO), so you may loose the fact that a string is a charcter string (something more than a byte string). Best regards McA	[reply] [d/l] [select]
Re^3: Encode double encoding? by farang (Chaplain) on May 23, 2014 at 18:59 UTC
What would I have to do to be able to use "s\x{c3}\x{bc}\x{c3}\x{9f}e" as though it were "s\xc3\xbc\xc3\x9fe" ? I'm not sure what else is going on, but those strings are equivalent. `perl -MO=Deparse -e 'print "s\xc3\xbc\xc3\x9fe"' perl -MO=Deparse -e 'print "s\x{c3}\x{bc}\x{c3}\x{9f}e"'` [download] Both parse the same: `print "s\303\274\303\237e"; -e syntax OK` [download] I wonder if you have standard output encoded correctly. Maybe adding this will help. `binmode STDOUT, ':utf8';` [download]	[reply] [d/l] [select]
Re^3: Encode double encoding? by Anonymous Monk on May 24, 2014 at 03:50 UTC
I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML? Because https://metacpan.org/pod/XML::LibXML#ENCODINGS-SUPPORT-IN-XML::LIBXML already gives you "characters" instead of bytes/octets/raw ... yields ... See replies by McA and farang , thanks monks :) Also, terminals lie, even browsers lie, the bytes do not lie ... see comments in this code (and the code, and keep in mind the 5min tutorial) Read more... (3 kB) _Is_ any of this covered in perlunitut, and if so, under what section? You can start with I/O flow (the actual 5 minute tutorial) You should read the whole thing and the links it links Also download tarball from Perl Unicode Essentials: OSCON 2011 - O'Reilly Conferences, July 25 - 29, 2011, Portland, OR for even more unicode info	[reply] [d/l]


There's more than one way to do things
	PerlMonks