DrkShadow has asked for the wisdom of the Perl Monks concerning the following question:
I've asked elsewhere and was told that this data is Unicode encoded twice, but I don't see that. I've tried decoding from the current format to the desired format, and I get the same data back. I try converting from the desired encoding to the original encoding, and... I get the expected data?! But then it doesn't print. I'm just confused -- could someone explain what is happening, and why?
#!/usr/bin/env perl use warnings; use strict; use XML::LibXML; use Encode qw(encode_utf8 decode_utf8 decode from_to _utf8_off is_utf8 +); my $xml = XML::LibXML->new(); $xml->load_ext_dtd(1); $xml = $xml->load_xml(load_ext_dtd => 0, string => "<?xml version='1.0' encoding=\"ISO-8859 +-1\" standalone=\"yes\"?><tag string=\"süße\" / +>"); # The literal hex from the document. In talking with others, # this is UTF-8 data. my $desired = "s\xc3\xbc\xc3\x9fe"; my $desired_utf8 = decode_utf8($desired); printf "Desired: %s;\t%v02x\n", $desired_utf8, $desired_utf8; # Prints: Desired: s��e; 73.fc.df.65. The # hex is right, but the string is wrong. ?! my $temp; my $tag = $xml->getElementsByTagName('tag'); my $string = ${$tag}[0]->attributes->getNamedItem('string')->nodeValue +; printf "obtained: %s;\t%v02x\n", $string, $string; printf "\tdec: %s, %v02x\n", decode_utf8($string), decode_utf8($string +); # Prints: # obtained: süße; 73.c3.bc.c3.9f.65 # dec: süße; 73.c3.bc.c3.9f.65 # The string is right, but the hex is wrong. ?!. Further, # why would these two lines be the same, given # decode_utf8()?! $temp = $string; print "\$temp " . (is_utf8($temp) ? 'is' : "isn't") . " tagged as a unicode string.\n"; _utf8_off($temp); print "\$temp " . (is_utf8($temp) ? 'is' : "isn't") . " tagged as a unicode string.\n"; printf "\tdec: %s, %v02x\n", decode_utf8($temp), decode_utf8($temp); # Prints: # $temp is tagged as a unicode string. # $temp isn't tagged as a unicode string. # dec: süße, 73.c3.bc.c3.9f.65 # Why, again, would this line not have assumed octets of # a UTF8 string?! $temp = $string; from_to($temp, 'iso-8859-1', 'utf8'); printf "\tfto: %s, %v02x\n", decode_utf8($temp), decode_utf8($temp); # An attempt to convert from what the document is supposed # to be to what I've been informed the string should be. # Prints: # fto: süße, 73.c3.bc.c3.9f.65 # Again, why is no decoding happening?! $temp = $string; from_to($temp, 'latin1', 'utf8'); printf "\tfto: %s, %v02x\n", decode_utf8($temp), decode_utf8($temp); # At attempt to force $temp into being assumed _just_ # octets, so that any unicode flags don't affect the # output.. just a guess. # prints: # fto: süße, 73.c3.bc.c3.9f.65 # Again, how could this be the same? $temp = $string; from_to($temp, 'utf8', 'ISO-8859-1'); printf "\tfto: %s, %v02x\n", $temp, $temp; # Prints: # fto: s��e, 73.fc.df.65 # The string is _incorrect_ and the decoded hex is # _correct_! Why?? # The actual use. Select an id from a table in a database # based on the UTF8 string. #$getcount = $groupdb->prepare('SELECT id FROM table WHERE UTF8_string=? AND timestamp=?'); #$getcount->execute(decode_utf8($string), $datetime); #$getcount->bind_columns(\$result); #$getcount->fetch(); # False #$getcount->execute(decode('iso-8859-1', $string), $datetime); #$getcount->bind_columns(\$result); #$getcount->fetch(); # False #$getcount->execute(decode_utf8($desired), $datetime); #$getcount->bind_columns(\$result); #$getcount->fetch(); # True -- this works.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Encode double encoding?
by Anonymous Monk on May 23, 2014 at 11:11 UTC | |
by DrkShadow (Initiate) on May 23, 2014 at 12:22 UTC | |
by McA (Priest) on May 23, 2014 at 13:53 UTC | |
by farang (Chaplain) on May 23, 2014 at 18:59 UTC | |
by Anonymous Monk on May 24, 2014 at 03:50 UTC |