DrkShadow has asked for the wisdom of the Perl Monks concerning the following question:

I've asked elsewhere and was told that this data is Unicode encoded twice, but I don't see that. I've tried decoding from the current format to the desired format, and I get the same data back. I try converting from the desired encoding to the original encoding, and... I get the expected data?! But then it doesn't print. I'm just confused -- could someone explain what is happening, and why?

#!/usr/bin/env perl use warnings; use strict; use XML::LibXML; use Encode qw(encode_utf8 decode_utf8 decode from_to _utf8_off is_utf8 +); my $xml = XML::LibXML->new(); $xml->load_ext_dtd(1); $xml = $xml->load_xml(load_ext_dtd => 0, string => "<?xml version='1.0' encoding=\"ISO-8859 +-1\" standalone=\"yes\"?><tag string=\"s&#195;&#188;&#195;&#159;e\" / +>"); # The literal hex from the document. In talking with others, # this is UTF-8 data. my $desired = "s\xc3\xbc\xc3\x9fe"; my $desired_utf8 = decode_utf8($desired); printf "Desired: %s;\t%v02x\n", $desired_utf8, $desired_utf8; # Prints: Desired: s&#65533;&#65533;e; 73.fc.df.65. The # hex is right, but the string is wrong. ?! my $temp; my $tag = $xml->getElementsByTagName('tag'); my $string = ${$tag}[0]->attributes->getNamedItem('string')->nodeValue +; printf "obtained: %s;\t%v02x\n", $string, $string; printf "\tdec: %s, %v02x\n", decode_utf8($string), decode_utf8($string +); # Prints: # obtained: süße; 73.c3.bc.c3.9f.65 # dec: süße; 73.c3.bc.c3.9f.65 # The string is right, but the hex is wrong. ?!. Further, # why would these two lines be the same, given # decode_utf8()?! $temp = $string; print "\$temp " . (is_utf8($temp) ? 'is' : "isn't") . " tagged as a unicode string.\n"; _utf8_off($temp); print "\$temp " . (is_utf8($temp) ? 'is' : "isn't") . " tagged as a unicode string.\n"; printf "\tdec: %s, %v02x\n", decode_utf8($temp), decode_utf8($temp); # Prints: # $temp is tagged as a unicode string. # $temp isn't tagged as a unicode string. # dec: süße, 73.c3.bc.c3.9f.65 # Why, again, would this line not have assumed octets of # a UTF8 string?! $temp = $string; from_to($temp, 'iso-8859-1', 'utf8'); printf "\tfto: %s, %v02x\n", decode_utf8($temp), decode_utf8($temp); # An attempt to convert from what the document is supposed # to be to what I've been informed the string should be. # Prints: # fto: süße, 73.c3.bc.c3.9f.65 # Again, why is no decoding happening?! $temp = $string; from_to($temp, 'latin1', 'utf8'); printf "\tfto: %s, %v02x\n", decode_utf8($temp), decode_utf8($temp); # At attempt to force $temp into being assumed _just_ # octets, so that any unicode flags don't affect the # output.. just a guess. # prints: # fto: süße, 73.c3.bc.c3.9f.65 # Again, how could this be the same? $temp = $string; from_to($temp, 'utf8', 'ISO-8859-1'); printf "\tfto: %s, %v02x\n", $temp, $temp; # Prints: # fto: s&#65533;&#65533;e, 73.fc.df.65 # The string is _incorrect_ and the decoded hex is # _correct_! Why?? # The actual use. Select an id from a table in a database # based on the UTF8 string. #$getcount = $groupdb->prepare('SELECT id FROM table WHERE UTF8_string=? AND timestamp=?'); #$getcount->execute(decode_utf8($string), $datetime); #$getcount->bind_columns(\$result); #$getcount->fetch(); # False #$getcount->execute(decode('iso-8859-1', $string), $datetime); #$getcount->bind_columns(\$result); #$getcount->fetch(); # False #$getcount->execute(decode_utf8($desired), $datetime); #$getcount->bind_columns(\$result); #$getcount->fetch(); # True -- this works.

Replies are listed 'Best First'.
Re: Encode double encoding?
by Anonymous Monk on May 23, 2014 at 11:11 UTC

    ... could someone explain what is happening, and why?

    Its easier to start from the beginning :) you've got too much code in the question, trim it down to absolute minimum, use Data::Dump::dd for generating sample input, and for verifying the encoding of data, deal with just 3 strings, input, wanted output, actual output, and explain what you don't like about actual output

    I don't see no problems here, or need to encode/decode anything myself (perlunitut: Unicode in Perl) since libxml knows about encodings

    #!/usr/bin/perl -- use warnings; use strict; use XML::LibXML; my $orig = "<?xml version='1.0' encoding=\"ISO-8859-1\" standalone=\"y +es\"?><tag string=\"s&#195;&#188;&#195;&#159;e\" />"; my $dom = XML::LibXML->new(qw/ recover 2 /)->load_xml( string => $ori +g ); dd( $orig ); dd( "$dom" ); dd( $dom->findvalue( q{//*/@string } ) ); printf "%v02x\n", $dom->findvalue( q{//*/@string } ); dd( map { ord $_ } split //, $dom->findvalue( q{//*/@string } ) ); __END__ "<?xml version='1.0' encoding=\"ISO-8859-1\" standalone=\"yes\"?><tag +string=\"s&#195;&#188;&#195;&#159;e\" />" "<?xml version=\"1.0\" encoding=\"ISO-8859-1\" standalone=\"yes\"?>\n< +tag string=\"s\xC3\xBC\xC3\x9Fe\"/>\n" "s\xC3\xBC\xC3\x9Fe" 73.c3.bc.c3.9f.65 (115, 195, 188, 195, 159, 101)

      I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML? Perhaps I don't know the correct question to ask.

      More details, a request for clarity, what is going on? Data::Dumper on the output from XML::LibXML yields: "s\x{c3}\x{bc}\x{c3}\x{9f}e"; Data::Dumper on the output from decode_utf8("s\xc3\xbc\xc3\x9fe") yields: süße; decode_utf8("s\x{c3}\x{bc}\x{c3}\x{9f}e") yields: s��e. What is the difference between all of these?

      What would I have to do to be able to use "s\x{c3}\x{bc}\x{c3}\x{9f}e" as though it were "s\xc3\xbc\xc3\x9fe" ?

      _Is_ any of this covered in perlunitut, and if so, under what section?

        Hi

        to your question:

        I suppose the question is: why can't I use decode_utf8() on output from XML::LibXML? Perhaps I don't know the correct question to ask.

        The output of XML::LibXML seems to be already decoded. When struggling with encoding/decoding the first question to ask is: Is my string a encoded byte representation of a unicode string (I like to call this a byte-string) or is it a character string where every character is represented by a codepoint of unicode? You can answer this question with the following code snippet:

        my $string = 'blabla'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n";

        Why did I choose such a simple string 'blabla'? I want to show that there is a hidden difference which you can't see:

        #!/bin/env perl use strict; use warnings; use Data::Dumper; use Encode; my $string = 'blabla'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n"; print Dumper(\$string), "\n"; my $decoded_string = decode('ASCII', $string); $is_unicode = utf8::is_utf8($decoded_string) ? 'YES' : 'NO'; print "my string: $decoded_string is unicode: $is_unicode\n"; print Dumper(\$decoded_string), "\n";

        Do you see the difference. Only the internal flag has changed. In the first case you have a byte string representing the unicode string 'blabla' in the ASCII encoding. In the second case you have decoded (transfer from representation to unicode) character string. Only the fact that the ascii byte sequence of 'blabla' and the code points for these charaters in unicode are the same let the whole thing look like it's the same. It isn't!

        Now to an example with the German esszett. We have to do some assumptions: Your terminal is set to UTF-8 and you store the following code snippets in UTF-8:

        #!/bin/env perl use strict; use warnings; use Data::Dumper; use Encode; my $string = 'müßig'; my $is_unicode = utf8::is_utf8($string) ? 'YES' : 'NO'; print "my string: $string is unicode: $is_unicode\n"; print Dumper(\$string), "\n"; my $decoded_string = decode('UTF-8', $string); $is_unicode = utf8::is_utf8($decoded_string) ? 'YES' : 'NO'; print "my string: $decoded_string is unicode: $is_unicode\n"; print Dumper(\$decoded_string), "\n";

        The output on my terminal of this script is:

        my string: müßig is unicode: NO $VAR1 = \'müßig'; my string: m__ig is unicode: YES $VAR1 = \"m\x{fc}\x{df}ig";

        (_ is a square on the terminal)

        That's intersting, isn't it?

        In the first case we do have a byte string which is the UTF-8 representation of 'müßig'. As soon as it's output your terminal sees a valid byte sequence which is interpreted as 'ü' followed by a 'ß' and shows it to you. It seems to be the correct case. In the second case the output is garbled. So why?

        We correctly decoded the byte string 'müßig' (which is UTF-8 encoded in the perl source code, make a hexdump -C sourcefile to validate this, you'll see a byte sequence hex c3 bc for 'ü' and hex c3 9f for 'ß'). The internal utf8-flag is set (by the way: what a misleading nomenclature IMHO, unicode-flag would have been better). Now you have a real character sting 'müßig' in $decoded_string. But as soon as you print it without changing some output encoding layers, Perl assumes an output encoding of Latin-1. Therefore Perl sends the byte hex fc for 'ü' and hex df for 'ß' (correct Latin-1 encodings). Your terminal can't cope with these two bytes and just duisplays an placeholder glyph to show this decoding problem.

        But have a look at the Dumper output. You see a differece here. The fact that Dumper has to dump a codepoint is shown with the codepoint representation of "\x{fc}" 'ü' and "\x{df}" 'ß'. So, this kind of Dumper output is often a sign that you just dumped a character string in contrast to a byte string.

        Conclusion: Use utf8::is_utf8() to determine whether a string is a unicode/character string or just a simple byte string. Be careful with concatenation of charater strings and simple byte strings. The resulting string will be a character string with the byte string decoded as Latin-1 and not UTF-8. That may bite you. There are some functions out there which silently drop this utf8-flag in the case that the byte representation of the byte string is the same as the character string (which is a bug IMHO), so you may loose the fact that a string is a charcter string (something more than a byte string).

        Best regards
        McA

        What would I have to do to be able to use "s\x{c3}\x{bc}\x{c3}\x{9f}e" as though it were "s\xc3\xbc\xc3\x9fe" ?

        I'm not sure what else is going on, but those strings are equivalent.

        perl -MO=Deparse -e 'print "s\xc3\xbc\xc3\x9fe"' perl -MO=Deparse -e 'print "s\x{c3}\x{bc}\x{c3}\x{9f}e"'
        Both parse the same:
        print "s\303\274\303\237e"; -e syntax OK
        I wonder if you have standard output encoded correctly. Maybe adding this will help.
        binmode STDOUT, ':utf8';