I ran into a problem using XML::Simple generating output XML. The input hash was a mix of utf8 and non-utf8 strings. At the last stage, XML::Simple::XMLout join's components together and I get corrupted data.

I found this behavior very odd so I put together a test case that shows join corrupting a non-utf8 string when join'ed with another utf8 string.

At first I thought it might be decoding the non-utf8 string using the locale (or LANG or whatever) to some other encoding, but running this on a LANG=en_US.UTF-8 system produced the same results.

Can anyone explain to me what is going on?

Sample code:

no warnings 'utf8'; use Encode qw(decode is_utf8); $r = "\xc2\xa9\xc2\xae\xe2\x84\xa2"; print "Raw \$r : ", $r, " - ", (is_utf8($r)?"is":"is not"), " utf8\n"; $u = decode('utf8', "\xc2\xa9\xc2\xae\xe2\x84\xa2"); print "UTF8 \$u : ", $u, " - ", (is_utf8($u)?"is":"is not"), " utf8\n"; $x = join('', $r, $u); print "Join(\$r, \$u): ", $x, " - ", (is_utf8($x)?"is":"is not"), " utf8\n"; $e = decode('utf8', $r); print "Encd \$e : ", $e, " - ", (is_utf8($e)?"is":"is not"), " utf8\n"; $y = join('', $e, $u); print "Join(\$e, \$u): ", $y, " - ", (is_utf8($y)?"is":"is not"), " utf8\n";
Sample Output:
Raw $r : ©®™ - is not utf8 UTF8 $u : ©®™ - is utf8 Join($r, $u): ©®â�¢©®™ - is utf8 Encd $e : ©®™ - is utf8 Join($e, $u): ©®™©®™ - is utf8

In reply to Problem with join'ing utf8 and non-utf8 strings (bug?) by rsmah

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.