Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^2: Encode double encoding?

by DrkShadow (Initiate)
on May 23, 2014 at 12:22 UTC ( [id://1087198]=note: print w/replies, xml ) Need Help??


in reply to Re: Encode double encoding?
in thread Encode double encoding?

I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML? Perhaps I don't know the correct question to ask.

More details, a request for clarity, what is going on? Data::Dumper on the output from XML::LibXML yields: "s\x{c3}\x{bc}\x{c3}\x{9f}e"; Data::Dumper on the output from decode_utf8("s\xc3\xbc\xc3\x9fe") yields: süße; decode_utf8("s\x{c3}\x{bc}\x{c3}\x{9f}e") yields: s��e. What is the difference between all of these?

What would I have to do to be able to use "s\x{c3}\x{bc}\x{c3}\x{9f}e" as though it were "s\xc3\xbc\xc3\x9fe" ?

_Is_ any of this covered in perlunitut, and if so, under what section?

Replies are listed 'Best First'.
Re^3: Encode double encoding?
by McA (Priest) on May 23, 2014 at 13:53 UTC

    Hi

    to your question:

    I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML? Perhaps I don't know the correct question to ask.

    The output of XML::LibXML seems to be already decoded. When struggling with encoding/decoding the first question to ask is: Is my string a encoded byte representation of a unicode string (I like to call this a byte-string) or is it a character string where every character is represented by a codepoint of unicode? You can answer this question with the following code snippet:

    my $string = 'blabla'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n";

    Why did I choose such a simple string 'blabla'? I want to show that there is a hidden difference which you can't see:

    #!/bin/env perl use strict; use warnings; use Data::Dumper; use Encode; my $string = 'blabla'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n"; print Dumper(\$string), "\n"; my $decoded_string = decode('ASCII', $string); $is_unicode = utf8::is_utf8($decoded_string) ? 'YES' : 'NO'; print "my string: $decoded_string is unicode: $is_unicode\n"; print Dumper(\$decoded_string), "\n";

    Do you see the difference. Only the internal flag has changed. In the first case you have a byte string representing the unicode string 'blabla' in the ASCII encoding. In the second case you have decoded (transfer from representation to unicode) character string. Only the fact that the ascii byte sequence of 'blabla' and the code points for these charaters in unicode are the same let the whole thing look like it's the same. It isn't!

    Now to an example with the German esszett. We have to do some assumptions: Your terminal is set to UTF-8 and you store the following code snippets in UTF-8:

    #!/bin/env perl use strict; use warnings; use Data::Dumper; use Encode; my $string = 'müßig'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n"; print Dumper(\$string), "\n"; my $decoded_string = decode('UTF-8', $string); $is_unicode = utf8::is_utf8($decoded_string) ? 'YES' : 'NO'; print "my string: $decoded_string is unicode: $is_unicode\n"; print Dumper(\$decoded_string), "\n";

    The output on my terminal of this script is:

    my string: müßig is unicode: NO $VAR1 = \'müßig'; my string: m__ig is unicode: YES $VAR1 = \"m\x{fc}\x{df}ig";

    (_ is a square on the terminal)

    That's intersting, isn't it?

    In the first case we do have a byte string which is the UTF-8 representation of 'müßig'. As soon as it's output your terminal sees a valid byte sequence which is interpreted as 'ü' followed by a 'ß' and shows it to you. It seems to be the correct case. In the second case the output is garbled. So why?

    We correctly decoded the byte string 'müßig' (which is UTF-8 encoded in the perl source code, make a hexdump -C sourcefile to validate this, you'll see a byte sequence hex c3 bc for 'ü' and hex c3 9f for 'ß'). The internal utf8-flag is set (by the way: what a misleading nomenclature IMHO, unicode-flag would have been better). Now you have a real character sting 'müßig' in $decoded_string. But as soon as you print it without changing some output encoding layers, Perl assumes an output encoding of Latin-1. Therefore Perl sends the byte hex fc for 'ü' and hex df for 'ß' (correct Latin-1 encodings). Your terminal can't cope with these two bytes and just duisplays an placeholder glyph to show this decoding problem.

    But have a look at the Dumper output. You see a differece here. The fact that Dumper has to dump a codepoint is shown with the codepoint representation of "\x{fc}" 'ü' and "\x{df}" 'ß'. So, this kind of Dumper output is often a sign that you just dumped a character string in contrast to a byte string.

    Conclusion: Use utf8::is_utf8() to determine whether a string is a unicode/character string or just a simple byte string. Be careful with concatenation of charater strings and simple byte strings. The resulting string will be a character string with the byte string decoded as Latin-1 and not UTF-8. That may bite you. There are some functions out there which silently drop this utf8-flag in the case that the byte representation of the byte string is the same as the character string (which is a bug IMHO), so you may loose the fact that a string is a charcter string (something more than a byte string).

    Best regards
    McA

Re^3: Encode double encoding?
by farang (Chaplain) on May 23, 2014 at 18:59 UTC

    What would I have to do to be able to use "s\x{c3}\x{bc}\x{c3}\x{9f}e" as though it were "s\xc3\xbc\xc3\x9fe" ?

    I'm not sure what else is going on, but those strings are equivalent.

    perl -MO=Deparse -e 'print "s\xc3\xbc\xc3\x9fe"' perl -MO=Deparse -e 'print "s\x{c3}\x{bc}\x{c3}\x{9f}e"'
    Both parse the same:
    print "s\303\274\303\237e"; -e syntax OK
    I wonder if you have standard output encoded correctly. Maybe adding this will help.
    binmode STDOUT, ':utf8';

Re^3: Encode double encoding?
by Anonymous Monk on May 24, 2014 at 03:50 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1087198]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2024-04-19 16:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found