rsmah has asked for the wisdom of the Perl Monks concerning the following question:
I ran into a problem using XML::Simple generating output XML. The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.
I found this behavior very odd so I put together a test case that shows join corrupting a non-utf8 string when join'ed with another utf8 string.
At first I thought it might be decoding the non-utf8 string using the locale (or LANG or whatever) to some other encoding, but running this on a LANG=en_US.UTF-8 system produced the same results.
Can anyone explain to me what is going on?
Sample code:
no warnings 'utf8';
use Encode qw(decode is_utf8);
$r = "\xc2\xa9\xc2\xae\xe2\x84\xa2";
print "Raw \$r : ", $r,
" - ", (is_utf8($r)?"is":"is not"), " utf8\n";
$u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2");
print "UTF8 \$u : ", $u,
" - ", (is_utf8($u)?"is":"is not"), " utf8\n";
$x = join('', $r, $u);
print "Join(\$r, \$u): ", $x,
" - ", (is_utf8($x)?"is":"is not"), " utf8\n";
$e = decode('utf8', $r);
print "Encd \$e : ", $e,
" - ", (is_utf8($e)?"is":"is not"), " utf8\n";
$y = join('', $e, $u);
print "Join(\$e, \$u): ", $y,
" - ", (is_utf8($y)?"is":"is not"), " utf8\n";
Sample Output:
Raw $r : ©®™ - is not utf8
UTF8 $u : ©®™ - is utf8
Join($r, $u): ©®â�¢©®™ - is utf8
Encd $e : ©®™ - is utf8
Join($e, $u): ©®™©®™ - is utf8
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by Juerd (Abbot) on Jun 17, 2008 at 18:23 UTC
|
Hello dear Unicode newbie,
You made one big mistake. Just one, so it's easy to fix. You assumed that you are supposed to look at the SvUTF8 flag, but you're not. It's an internal value, and because it's Perl you're allowed to look at its state. But you really shouldn't, if you want to keep your sanity.
Don't use is_utf8, okay? If you really want to know about internal flags, please use Devel::Peek's Dump function instead. It will print some extra useful internal values too, such as the other flags in Perl like NOK and IOK. For that matter, pretend that the UTF8 flag's name is UOK.
Better yet, pretend that the UTF8 flag does not exist. Perl just picks an encoding for numeric and string values automatically, and only in edge cases (and if you're dealing with internals or XS) you need to know what is going on.
Read perlunitut and perlunifaq, and realise that you sometimes may need to use Unicode::Semantics (or utf8::upgrade) before text functions operate correctly.
I think it's best if I don't explain what goes on in your code, and if you ignore explanations by others. Trying to understand what's going on internally is a nice exercise for when you know how to write good Unicode capable code, but not before that.
Decode your input, and encode your output. Don't query or set the SvUTF8 flag. Thanks!
Best regards,
Juerd
| [reply] |
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by ikegami (Patriarch) on Jun 17, 2008 at 18:09 UTC
|
Two mistakes.
-
The first is that you think that $r contains 3 characters.
$r contains 7 characters or 7 bytes.
$u contains 3 characters.
So $x contains 10 (7+3) characters.
When concatenated with characters (is_utf8 == true), bytes are treated as characters.
$e contains 3 characters.
so $y contains 6 (3+3) characters.
-
The second is that you think you're outputting UTF-8.
You're outputting iso-latin-1 characters since you haven't said otherwise. You happen to mix in some UTF-8, but you silenced the message warning you of this problem.
If you want to output something other than iso-latin-1, you do do so by using open (the pragma):
use open qw( :std :locale );
Update: Below is the fixed code (which was modified to output the length of the strings) and the output for a UTF-8 locale.
use open qw( :std :locale );
use Encode qw(decode is_utf8);
$r = "\xc2\xa9\xc2\xae\xe2\x84\xa2";
print "Raw \$r : ", sprintf('%2d', length($r)), " ", $r,
" - ", (is_utf8($r)?"is":"is not"), " utf8\n";
$u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2");
print "UTF8 \$u : ", sprintf('%2d', length($u)), " ", $u,
" - ", (is_utf8($u)?"is":"is not"), " utf8\n";
$x = join('', $r, $u);
print "Join(\$r, \$u): ", sprintf('%2d', length($x)), " ", $x,
" - ", (is_utf8($x)?"is":"is not"), " utf8\n";
$e = decode('utf8', $r);
print "Encd \$e : ", sprintf('%2d', length($e)), " ", $e,
" - ", (is_utf8($e)?"is":"is not"), " utf8\n";
$y = join('', $e, $u);
print "Join(\$e, \$u): ", sprintf('%2d', length($y)), " ", $y,
" - ", (is_utf8($y)?"is":"is not"), " utf8\n";
Raw $r : 7 ©®⢠- is not utf8
UTF8 $u : 3 ©®™ - is utf8
Join($r, $u): 10 ©®⢩®™ - is utf8
Encd $e : 3 ©®™ - is utf8
Join($e, $u): 6 ©®™©®™ - is utf8
| [reply] [d/l] [select] |
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by almut (Canon) on Jun 17, 2008 at 17:58 UTC
|
I think it works as designed. In other words, if you concat a
unicode/character string with a non-unicode/byte string, the byte
string will automatically be upgraded to unicode, with the non-ASCII
values being interpreted as if they were in Latin-1 encoding — i.e. the first byte
\xc2 (Â in Latin-1) becomes Unicode-Â (which happens to be codepoint U+00C2) encoded as
UTF-8 (i.e. the bytes \xc3\x82), the second byte \xa9 (© in Latin-1)
becomes Unicode-© (codepoint U+00A9) encoded as
UTF-8 (i.e. the bytes \xc2\xa9), etc...
Update: if you print a hexdump of your string $x, e.g.
sub hexdump {
my $s = shift;
print join " ", unpack("(H2)*", $s), "\n";
}
# ...
hexdump($x);
you'd get
c3 82 c2 a9 c3 82 c2 ae c3 a2 c2 84 c2 a2 c2 a9 c2 ae e2 84 a2
with the first 4 bytes showing the result (UTF-8 encoding) of the conversion I tried
to describe above.
Or, fully expanded:
_________ $r (auto-upgraded) _________ ________ $u ________
c2 a9 c2 ae e2 84 a2 c2-a9 c2-ae e2-84-a2
| | | | | | | | | | | | | |
c3-82 c2-a9 c3-82 c2-ae c3-a2 c2-84 c2-a2 c2-a9 c2-ae e2-84-a2
 ©  ® â U0084 ¢ © ® (TM)
| [reply] [d/l] [select] |
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by graff (Chancellor) on Jun 18, 2008 at 06:36 UTC
|
You said:
The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.
Well, if the "non-utf8 strings" happen to be all ascii characters (ord()<128), then it won't matter, because they are just a proper subset of utf8, and concatenating these with utf8 strings causes no problem.
But if a "non-utf8" string happens to also be "non-ascii", then what would you expect to happen when you concatenate this with a utf8 string? What would you expect to do with the result of such a concatenation? (Hint: unless the answer is something strange and ad-hoc involving pack and unpack, then the real answer is: something incoherent.)
You can't just throw utf8 characters and non-utf8/non-ascii data into a single scalar value and expect to get anything usable. If you combine data this way, the bug you expose is not in perl, but rather in your expectations.
Either keep these data types separate at all times, or else, if the latter type is actually character data in some other encoding, then decode() it into utf8 characters (refer to the Encode module) -- or alternatively, encode() the utf8 string into the same character set as the other data, before concatenating.
UPDATE: Actually, as pointed out by almut, perl's default behavior (interpret non-ascii/non-utf8 bytes as Latin-1 characters) makes it possible that one of the "more likely" situations -- converting some old single-byte Latin-1 text data to utf8 -- can be handled automatically, and produces a coherent result. It's only when the non-utf8 data is neither ascii nor Latin-1 that the trouble starts. | [reply] [d/l] [select] |
Re: Problem with join'ing utf8 and non-utf8 strings (bug?)
by jbert (Priest) on Jun 18, 2008 at 14:26 UTC
|
In case it's not obvious from what other people have said above:
- Perl is autoconverting your non-tagged string to utf8 for you. In doing so, it assumes it is already in an encoding (iso-latin-1). This assumption is what is at odds with your expectations (you're thinking of this data as a series of utf8 chars, rather than a series of latin-1 chars).
- Everything should work out OK as long as you ensure the inputs+outputs to your program tag data appropriately. That is, look into 'binmode' to set the :utf8 flag on a filehandle, and/or the 'open' module listed above, and perhaps -Cio cmdline option.
- Other sources of data can be a pain. e.g. stuff pulled from a db. There are ways around this (see mysql_enable_utf8 in DBD::mysql, and associated charset setttings on the db server side).
- The thing to remember is that you don't want a mix of utf8 tagged and non-tagged data loose in your code. The best way to achieve this is to ensure that all data is tagged at the entry points.
- Some CPAN modules just don't seem to play nicely with correctly-tagged utf8 data. (e.g. Template::Toolkit requires that you stick a byte-order-mark in your templates (ugh) rather than allowing you to tell it an encoding).
| [reply] |
|
|