xachen has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks, I seek your assistance with my limited unicode skills. I get unicode in the form of for example \u6b63 which I convert as:
$_ = "\\x{6b63}";
However if I print $_ it outputs as:
\x{6b63}
I'm not trying to output the "literal string" but the chinese character which is 正. Any pointers? The following code executes:
use Encode; $_ = "\u6b63"; $_ =~ s/\\u(.{4})/chr($1)/eg; $char = "\\x{$_}"; print "raw is $char \n"; print "decode is " . Encode::decode("unicode", $char). "\n"; __UNDESIRED OUTPUT__ raw is \x{6b63} decode is \x{6b63}
Thanks, Justin

Replies are listed 'Best First'.
Re: unicode/utf string to actual character
by moritz (Cardinal) on Sep 22, 2009 at 19:14 UTC
    "\u" is a known escape sequence (meaning "uppercase"), so you need to write "\\u" or '\u' instead:
    use Encode; use strict; use warnings; use 5.010; # only required for say(), not for the encoding stuff binmode STDOUT, ':encoding(UTF-8)'; $_ = "\\u6b63"; $_ =~ s/\\u(.{4})/chr(hex $1)/eg; say $_;

    See also: Perl and Unicode.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re: unicode/utf string to actual character
by zwon (Abbot) on Sep 22, 2009 at 19:10 UTC

    It should be "\x{6b63}" (only one "\")

      This is where I struggle as when I tried this initially it didn't work and gave back nothing. Results:
      use Encode; $_ = "\u6b63"; $_ =~ s/\\u(.{4})/chr($1)/eg; $char = "\x{$_}"; print "raw is $char \n"; print "decode is " . Encode::decode("unicode", $char). "\n"; __FAILED OUTPUT__ raw is decode is
        • String literal "\u6b63" creates the string 6b63 (since \u uppercases the next character). To create the string \u6b63, you need to use string literal "\\u6b63".

        • You can't interpolate into the middle of an escape sequence (like "\x{$_}"). Interpolation and escapes occur at the same level.

        • You're passing the hex representation of a number to chr, but chr expects a number. You can use hex to do the conversion.

        • $_ doesn't contain the hex number of the character as you'd need in the escape sequence. It contains the decoded character already (as returned by chr).

        use open ':std', ':locale'; # So stuff you print shows up right. $_ = "\\u6b63"; print "orig string length is ", length($_), "\n"; print "orig string is $_\n"; s/\\u(.{4})/chr(hex($1))/eg; print "decoded string length is ", length($_), "\n"; print "decoded string is $_\n";
        orig string length is 6
        orig string is \u6b63
        decoded string length is 1
        decoded string is 正
        

        Update: Added lots

Re: unicode/utf string to actual character
by ikegami (Patriarch) on Sep 22, 2009 at 19:56 UTC
Re: unicode/utf string to actual character
by graff (Chancellor) on Sep 23, 2009 at 00:36 UTC